Backpropagation
How a neural network learns — computing gradients via the chain rule and propagating them backward through every layer.

The big picture

Training has two phases. The forward pass computes a prediction. The backward pass figures out how much each weight was responsible for the error, then nudges every weight in the direction that reduces it.

The key insight: the chain rule from calculus lets us break a complex derivative into a product of simpler ones — each layer only needs to know its own local gradient.

Forward pass — layer by layer

Each layer does the same two operations: a linear transform, then a non-linearity.

z = W·x + b ← pre-activation (linear) a = σ(z) ← post-activation (non-linear) L = loss(a_final, y) ← scalar error at the end
Without non-linearities: W3(W2(W1·x)) = W_combined·x — the entire deep network collapses to a single linear model. Depth becomes meaningless.

Backward pass — the chain rule

To update a weight W¹ in the first layer, we need ∂L/∂W¹. The chain rule lets us factor this:

∂L/∂W¹ = ∂L/∂a³ · ∂a³/∂z³ · ∂z³/∂a² · ∂a²/∂z² · ∂z²/∂a¹ · ∂a¹/∂z¹ · ∂z¹/∂W¹ Computed RIGHT to LEFT — from loss back toward inputs. Each layer stores its δ (local error signal) and passes it back. δ_output = ∂L/∂z = (a - y) ← for MSE loss δ_hidden = (W_next^T · δ_next) ⊙ σ'(z) ← propagate backward ∂L/∂W = δ · a_prev^T ∂L/∂b = δ

Activation functions — live demo

Vanishing & exploding gradients

Vanishing

Sigmoid/tanh saturate → derivatives near 0 → gradients shrink exponentially through layers → early layers learn nothing.

Fix: ReLU activations, skip connections (ResNets), batch normalization.

Exploding

Gradients grow exponentially — weights blow up, training diverges (NaN loss).

Fix: Gradient clipping, weight initialization, layer norm.

Weight initialization

MethodFormulaUse withWhy
XavierW ~ U[-√(6/(nᵢₙ+nₒᵤₜ)), ...]Sigmoid, TanhKeeps variance stable across layers
HeW ~ N(0, √(2/nᵢₙ))ReLUAccounts for ReLU zeroing half inputs
ZeroW = 0Never!Symmetry: all neurons identical, never learn

Dropout

During training, randomly zero out each neuron with probability p. Forces the network to learn redundant representations — no single neuron gets relied upon.

Train time (inverted dropout): mask = Bernoulli(1-p) a_drop = a ⊙ mask / (1-p) ← scale up so expected value stays same Test time: a_out = a ← use all neurons as-is, no scaling needed
Optimization Algorithms
Six algorithms that solve the same problem — update weights to minimize loss — but differ in how they use gradient history.

Side-by-side comparison

OptimizerExtra memoryAdaptive LRBest forPitfall
SGDNoneNoVision with tuningSlow, oscillates in ravines
MomentumvNoConvex + noisy lossCan overshoot
NAGvNoSlightly better than momentumMore complex to implement
AdagradG (sum sq.)YesSparse NLP featuresLR decays to zero
RMSPropYesRNNs, non-stationaryNo bias correction
Adamm, vYesDefault choice, NLPMay generalize worse than SGD

Gradient descent variants

Batch GD

Full dataset per update. Stable gradient estimate but extremely slow. Not practical for large datasets.

SGD

One sample per update. Noisy but fast. Noise helps escape local minima. Still the term used for mini-batch in practice.

Mini-batch

Best of both worlds — batch size 32–256. GPU-parallelizable. Standard in practice.

Learning rate scheduling

Step decay: lr = lr₀ × γ^⌊epoch/step_size⌋ Cosine decay: lr = lr_min + ½(lr_max - lr_min)(1 + cos(πt/T)) Warmup: lr increases linearly for k steps, then decays OneCycleLR: warmup → peak → cosine decay (fast convergence)
Warmup is critical when fine-tuning transformers — a large learning rate at the start will destroy pre-trained weights.
Convolutional Neural Networks
CNNs exploit the spatial structure of images using weight sharing, local connectivity, and hierarchical feature learning.

Why convolutions?

Weight sharing

The same 3×3 filter slides across the entire image. An edge detector learned in the top-left corner works everywhere. Vastly fewer parameters than a dense layer.

Hierarchical features

Layer 1 detects edges. Layer 2 detects textures. Layer 3 detects object parts. Layer N detects whole objects. Depth builds abstraction.

Conv layer math

Output size = ⌊(W - F + 2P) / S⌋ + 1 W = input size, F = filter size, P = padding, S = stride Parameters = F × F × C_in × C_out + C_out (biases) Example: 3×3 conv, 64 → 128 channels = 3×3×64×128 + 128 = 73,856 params Dense equivalent would be: 64×H×W × 128×H×W = millions

Batch normalization

x_norm = (x - μ_batch) / √(σ²_batch + ε) y = γ · x_norm + β ← learned scale and shift per channel

Benefits: faster training, acts as regularizer, allows higher learning rates, reduces sensitivity to initialization. Applied before or after activation (before is more common now).

Feature visualization techniques

TechniqueWhat it showsHow
Filter vizWhat pattern a filter detectsDisplay learned conv1 weight kernels directly
Activation mapsWhere a filter fires on an inputForward pass → visualize intermediate feature map
Grad-CAMWhere model looked for its decisionGradients into last conv layer → weighted spatial map
Activation maxWhat input maximally excites a neuronBackprop into input pixels, gradient ascent
SHAPPer-pixel attributionPerturb patches, measure output change

Grad-CAM explained

1. Forward pass — store last conv layer activations A^k (H×W×K) 2. Backprop — compute gradients ∂y_c/∂A^k for class c 3. Global average pool the gradients: α_k = (1/HW) Σᵢⱼ ∂y_c/∂A^k_ij ← importance of channel k 4. Weighted combination + ReLU: L_CAM = ReLU(Σ_k α_k · A^k) ← heatmap (H×W) 5. Upsample heatmap to input resolution and overlay

Fine-tuning pre-trained models

1
Load pre-trained model

Load ResNet/EfficientNet/ViT pre-trained on ImageNet. It has learned rich features: edges → textures → object parts → objects across 1.2M images.

2
Replace the classification head

Remove the final FC layer (1000 ImageNet classes). Add a new FC layer for your number of classes, randomly initialized.

3
Freeze early layers

Freeze layers 1–N. Only the new head is trainable. Early layers detect generic features that transfer — no need to relearn them.

4
Train with small LR

Use LR ≈ 1e-4 or smaller. Use discriminative learning rates — tiny LR for frozen/early layers, larger for the new head.

5
Progressive unfreezing

After convergence, unfreeze more layers and continue with an even smaller LR. Repeat until fully unfrozen if needed.

Normalize inputs with ImageNet stats: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] — pre-trained models expect this exact distribution.

Classic architectures

ModelKey innovationParams
AlexNet (2012)First deep CNN to win ImageNet; introduced ReLU + dropout60M
VGGNet (2014)Only 3×3 convs stacked deep — depth matters138M
ResNet (2015)Skip connections — enables 100+ layer training25M (50-layer)
EfficientNet (2019)Compound scaling of width + depth + resolution jointly5–66M
ViT (2020)Pure transformer on image patches — no convolutions86M (Base)
RNN / LSTM / GRU & Attention
Sequence models that maintain state across time steps — essential for text, speech, time series, and transliteration.

Vanilla RNN

h_t = tanh(W_hh · h_{t-1} + W_xh · x_t + b_h) y_t = W_hy · h_t + b_y h_t is the "memory" — carries information from all previous steps. Problem: gradients are multiplied by W_hh at EVERY step. If eigenvalues of W_hh < 1 → gradients shrink exponentially → vanishing.
An RNN with 50 time steps must backprop through 50 matrix multiplications. With W_hh having eigenvalues < 1, the gradient at step 1 is W_hh^50 ≈ 0. The network forgets anything beyond ~10 steps.

LSTM — Long Short-Term Memory

Introduces a cell state C_t — a highway for gradients that flows with only element-wise operations (no matrix multiply). Three gates control information flow:

Forget gate: f_t = σ(W_f · [h_{t-1}, x_t] + b_f) ← what to erase from C Input gate: i_t = σ(W_i · [h_{t-1}, x_t] + b_i) ← what to write to C Candidate: C̃_t = tanh(W_C · [h_{t-1}, x_t] + b_C) ← what to write Cell update: C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t ← update cell state Output gate: o_t = σ(W_o · [h_{t-1}, x_t] + b_o) ← what to expose Hidden state: h_t = o_t ⊙ tanh(C_t)
The cell state C_t flows through with only ⊙ (element-wise multiply) and + operations — no weight matrix in the gradient path. This is why gradients don't vanish.

GRU — Gated Recurrent Unit

Simplified LSTM: merges forget + input into one update gate, and merges cell state + hidden state. Fewer parameters, often matches LSTM:

Reset gate: r_t = σ(W_r · [h_{t-1}, x_t]) Update gate: z_t = σ(W_z · [h_{t-1}, x_t]) Candidate: h̃_t = tanh(W · [r_t ⊙ h_{t-1}, x_t]) New state: h_t = (1 - z_t) ⊙ h_{t-1} + z_t ⊙ h̃_t ← interpolate old and new

Use LSTM when

Long sequences with complex dependencies. Large datasets where extra parameters help. Need separate cell and hidden state control.

Use GRU when

Smaller datasets. Faster training needed. Simpler dependencies. Usually competitive with LSTM at lower computational cost.

Seq2Seq for transliteration

Encoder: reads source chars x₁...xₙ → final hidden state h_n Decoder: generates target chars y₁...yₘ using h_n The entire source sequence is compressed into ONE fixed-size vector. This is the bottleneck — works for short sequences, fails for long ones.

Attention mechanism

Instead of one fixed context vector, the decoder attends to ALL encoder hidden states at each decoding step:

score(h_s, h_t) = h_t^T · W_a · h_s ← alignment score (additive: v·tanh(W₁h_s + W₂h_t)) α_{ts} = softmax(score) over all s ← attention weights (sum to 1) context_t = Σ_s α_{ts} · h_s ← weighted sum of encoder states output_t = decoder(context_t, h_{t-1}, y_{t-1})
For transliteration, attention learns near-monotonic alignment but handles 1-to-many character mappings (one Hindi character → 2 Latin chars). This directly improved the BLEU score.

Types of attention

TypeScore functionNotes
Dot-producth_s · h_tFast, requires same dimension
Scaled dot-product(h_s · h_t) / √d_kUsed in Transformers — prevents softmax saturation
Additive (Bahdanau)v^T tanh(W₁h_s + W₂h_t)Original attention paper, learned alignment
Multi-headConcat h heads of scaled dotTransformers — attends to multiple positions simultaneously
Naive Bayes
A probabilistic classifier based on Bayes' theorem with one bold assumption: all features are conditionally independent given the class label.

Bayes' theorem

P(y | x) = P(x | y) · P(y) / P(x) Posterior = Likelihood × Prior / Evidence For classification we compare classes, so P(x) is constant — we ignore it: argmax_y P(y | x) = argmax_y P(x | y) · P(y)

The naive assumption

P(x₁, x₂, ..., xₙ | y) = P(x₁|y) · P(x₂|y) · ... · P(xₙ|y). Features are assumed independent given the class. Almost never true in reality — but works surprisingly well, especially for text classification.

Training — what we estimate from data

Prior: P(y=c) = count(y=c) / N Likelihood: P(xᵢ=v|y=c) = count(xᵢ=v in class c) / count(class c) For continuous features → Gaussian NB: P(xᵢ|y=c) = (1/√(2πσ²_ic)) · exp(-(xᵢ-μ_ic)² / 2σ²_ic) Estimate μ_ic and σ²_ic per feature per class from training data.

Laplace smoothing — handle unseen values

Problem: if a word never appeared in class c during training, P(word|c) = 0 Then the entire product P(x|c) = 0 — one unseen word kills the prediction. P(xᵢ=v | y=c) = (count(xᵢ=v, y=c) + α) / (count(y=c) + α · |V|) α = 1 → Laplace smoothing. α > 0 → no zero probabilities ever.

Log probabilities — avoid underflow

Problem: multiplying 1000 probabilities like 0.001 × 0.003 × ... → rounds to 0 in float32 log P(y|x) ∝ log P(y) + Σᵢ log P(xᵢ|y) Products become sums of logs — numerically stable, same argmax result.

Implementation from scratch

class NaiveBayes: def fit(self, X, y): self.classes = np.unique(y) self.priors, self.means, self.vars = {}, {}, {} for c in self.classes: X_c = X[y == c] self.priors[c] = len(X_c) / len(X) self.means[c] = X_c.mean(axis=0) self.vars[c] = X_c.var(axis=0) def _log_likelihood(self, x, mean, var): log_exp = -((x - mean)**2) / (2 * var + 1e-9) log_norm = -0.5 * np.log(2 * np.pi * var + 1e-9) return np.sum(log_norm + log_exp) def predict(self, X): preds = [] for x in X: scores = {c: np.log(self.priors[c]) + self._log_likelihood(x, self.means[c], self.vars[c]) for c in self.classes} preds.append(max(scores, key=scores.get)) return preds

Variants

VariantFeature typeLikelihood modelBest use case
Gaussian NBContinuousGaussian distributionReal-valued features (sepal length, etc.)
Multinomial NBCounts / frequenciesMultinomial distributionText classification (word counts, TF-IDF)
Bernoulli NBBinary (present/absent)Bernoulli distributionShort docs, binary bag-of-words
Complement NBCountsComplement class statsImbalanced text datasets
AdaBoost
Adaptive Boosting: train a sequence of weak learners where each one focuses on the mistakes of its predecessors, then combine with weighted vote.

Core intuition

A "weak learner" only needs to be slightly better than random (>50% accuracy). AdaBoost amplifies the weights of misclassified samples so each new learner is forced to focus on the hard cases.

Algorithm — full walkthrough

Initialize: wᵢ = 1/N for all i = 1...N for t = 1 to T: a) Train weak learner h_t on weighted training data b) Weighted error: εₜ = Σᵢ wᵢ · 𝟙[h_t(xᵢ) ≠ yᵢ] c) Learner weight: αₜ = ½ · ln((1 - εₜ) / εₜ) ← large if ε→0, zero if ε=0.5 d) Update weights: wᵢ ← wᵢ · exp(-αₜ · yᵢ · h_t(xᵢ)) ↑ grows for misclassified, shrinks for correct e) Normalize: wᵢ ← wᵢ / Σⱼ wⱼ Final: H(x) = sign(Σₜ αₜ · h_t(x)) ← weighted majority vote

Understanding the learner weight α

When ε → 0 (near-perfect learner): α → ∞ — huge weight, dominates the ensemble. When ε = 0.5 (random guess): α = 0 — completely ignored. When ε > 0.5: α < 0 — flip its predictions!

Weight update explained

Correctly classified (yᵢ·h_t(xᵢ) = +1)

exp(-αₜ) < 1 → weight shrinks. This sample gets less attention in the next round.

Misclassified (yᵢ·h_t(xᵢ) = -1)

exp(+αₜ) > 1 → weight grows. Next learner is forced to focus here.

Implementation from scratch

class DecisionStump: def fit(self, X, y, weights): best = {'err': float('inf')} for feat in range(X.shape[1]): for thresh in np.unique(X[:, feat]): for polarity in [1, -1]: preds = polarity * np.sign(X[:, feat] - thresh) err = np.sum(weights[preds != y]) if err < best['err']: best = {'err':err, 'feat':feat, 'thresh':thresh, 'pol':polarity} self.__dict__.update(best) def predict(self, X): return self.pol * np.sign(X[:, self.feat] - self.thresh) class AdaBoost: def fit(self, X, y, T=50): n = len(y) w = np.ones(n) / n self.stumps, self.alphas = [], [] for _ in range(T): stump = DecisionStump() stump.fit(X, y, w) preds = stump.predict(X) eps = np.sum(w[preds != y]) + 1e-10 alpha = 0.5 * np.log((1 - eps) / eps) w *= np.exp(-alpha * y * preds) w /= w.sum() self.stumps.append(stump) self.alphas.append(alpha) def predict(self, X): return np.sign(sum(a * s.predict(X) for a, s in zip(self.alphas, self.stumps)))

Properties

Training error

Decreases exponentially with T. Provably converges to zero training error if each weak learner beats random chance.

Sensitivity

Resistant to overfitting in practice. But very sensitive to noisy labels and outliers — they get amplified exponentially.

Loss Functions
The loss defines what the model optimizes for. Choosing the wrong one can cripple training even with a perfect architecture.

Interactive comparison — regression losses

2.0
4.000
MSE
2.000
MAE
1.500
Huber (δ=1)

Implementation

import numpy as np # Regression losses def mse(y, yhat): return np.mean((y - yhat)**2) def mae(y, yhat): return np.mean(np.abs(y - yhat)) def huber(y, yhat, d=1): e = y - yhat return np.where(np.abs(e) <= d, 0.5 * e**2, d * (np.abs(e) - 0.5*d)).mean() # Classification losses def bce(y, p): p = np.clip(p, 1e-9, 1-1e-9) return -np.mean(y*np.log(p) + (1-y)*np.log(1-p)) def cross_entropy(y_onehot, logits): p = softmax(logits) return -np.sum(y_onehot * np.log(p + 1e-9)) def softmax(x): e = np.exp(x - np.max(x)) # subtract max for numerical stability return e / e.sum() def hinge(y, s): # y in {-1, +1} return np.mean(np.maximum(0, 1 - y*s)) def kl_div(p, q): return np.sum(p * np.log(p / (q + 1e-9)))

Gradient cheat sheet

LossGradient ∂L/∂ŷKey property
MSE2(ŷ - y) / nLarge errors → large gradient
MAEsign(ŷ - y) / nConstant magnitude, sparse
BCE-y/ŷ + (1-y)/(1-ŷ)With sigmoid: simplifies to ŷ - y
CE + Softmaxŷ - y_onehotElegantly simple — just predictions minus labels
Hinge-y if y·s < 1, else 0Zero gradient for correctly classified (sparse)