ML Study — Deep Dive

Backpropagation

How a neural network learns — computing gradients via the chain rule and propagating them backward through every layer.

The big picture

Training has two phases. The forward pass computes a prediction. The backward pass figures out how much each weight was responsible for the error, then nudges every weight in the direction that reduces it.

The key insight: the chain rule from calculus lets us break a complex derivative into a product of simpler ones — each layer only needs to know its own local gradient.

Forward pass — layer by layer

Each layer does the same two operations: a linear transform, then a non-linearity.

z = W·x + b ← pre-activation (linear) a = σ(z) ← post-activation (non-linear) L = loss(a_final, y) ← scalar error at the end

Without non-linearities: W3(W2(W1·x)) = W_combined·x — the entire deep network collapses to a single linear model. Depth becomes meaningless.

Backward pass — the chain rule

To update a weight W¹ in the first layer, we need ∂L/∂W¹. The chain rule lets us factor this:

∂L/∂W¹ = ∂L/∂a³ · ∂a³/∂z³ · ∂z³/∂a² · ∂a²/∂z² · ∂z²/∂a¹ · ∂a¹/∂z¹ · ∂z¹/∂W¹ Computed RIGHT to LEFT — from loss back toward inputs. Each layer stores its δ (local error signal) and passes it back. δ_output = ∂L/∂z = (a - y) ← for MSE loss δ_hidden = (W_next^T · δ_next) ⊙ σ'(z) ← propagate backward ∂L/∂W = δ · a_prev^T ∂L/∂b = δ

Activation functions — live demo

Vanishing & exploding gradients

Vanishing

Sigmoid/tanh saturate → derivatives near 0 → gradients shrink exponentially through layers → early layers learn nothing.

Fix: ReLU activations, skip connections (ResNets), batch normalization.

Exploding

Gradients grow exponentially — weights blow up, training diverges (NaN loss).

Fix: Gradient clipping, weight initialization, layer norm.

Weight initialization

Method	Formula	Use with	Why
Xavier	W ~ U[-√(6/(nᵢₙ+nₒᵤₜ)), ...]	Sigmoid, Tanh	Keeps variance stable across layers
He	W ~ N(0, √(2/nᵢₙ))	ReLU	Accounts for ReLU zeroing half inputs
Zero	W = 0	Never!	Symmetry: all neurons identical, never learn

Dropout

During training, randomly zero out each neuron with probability p. Forces the network to learn redundant representations — no single neuron gets relied upon.

Train time (inverted dropout): mask = Bernoulli(1-p) a_drop = a ⊙ mask / (1-p) ← scale up so expected value stays same Test time: a_out = a ← use all neurons as-is, no scaling needed

Optimization Algorithms

Six algorithms that solve the same problem — update weights to minimize loss — but differ in how they use gradient history.

Side-by-side comparison

Optimizer	Extra memory	Adaptive LR	Best for	Pitfall
SGD	None	No	Vision with tuning	Slow, oscillates in ravines
Momentum	v	No	Convex + noisy loss	Can overshoot
NAG	v	No	Slightly better than momentum	More complex to implement
Adagrad	G (sum sq.)	Yes	Sparse NLP features	LR decays to zero
RMSProp	v²	Yes	RNNs, non-stationary	No bias correction
Adam	m, v	Yes	Default choice, NLP	May generalize worse than SGD

Gradient descent variants

Batch GD

Full dataset per update. Stable gradient estimate but extremely slow. Not practical for large datasets.

SGD

One sample per update. Noisy but fast. Noise helps escape local minima. Still the term used for mini-batch in practice.

Mini-batch

Best of both worlds — batch size 32–256. GPU-parallelizable. Standard in practice.

Learning rate scheduling

Step decay: lr = lr₀ × γ^⌊epoch/step_size⌋ Cosine decay: lr = lr_min + ½(lr_max - lr_min)(1 + cos(πt/T)) Warmup: lr increases linearly for k steps, then decays OneCycleLR: warmup → peak → cosine decay (fast convergence)

Warmup is critical when fine-tuning transformers — a large learning rate at the start will destroy pre-trained weights.

Convolutional Neural Networks

CNNs exploit the spatial structure of images using weight sharing, local connectivity, and hierarchical feature learning.

Why convolutions?

Weight sharing

The same 3×3 filter slides across the entire image. An edge detector learned in the top-left corner works everywhere. Vastly fewer parameters than a dense layer.

Hierarchical features

Layer 1 detects edges. Layer 2 detects textures. Layer 3 detects object parts. Layer N detects whole objects. Depth builds abstraction.

Conv layer math

Output size = ⌊(W - F + 2P) / S⌋ + 1 W = input size, F = filter size, P = padding, S = stride Parameters = F × F × C_in × C_out + C_out (biases) Example: 3×3 conv, 64 → 128 channels = 3×3×64×128 + 128 = 73,856 params Dense equivalent would be: 64×H×W × 128×H×W = millions

Batch normalization

x_norm = (x - μ_batch) / √(σ²_batch + ε) y = γ · x_norm + β ← learned scale and shift per channel

Benefits: faster training, acts as regularizer, allows higher learning rates, reduces sensitivity to initialization. Applied before or after activation (before is more common now).

Feature visualization techniques

Technique	What it shows	How
Filter viz	What pattern a filter detects	Display learned conv1 weight kernels directly
Activation maps	Where a filter fires on an input	Forward pass → visualize intermediate feature map
Grad-CAM	Where model looked for its decision	Gradients into last conv layer → weighted spatial map
Activation max	What input maximally excites a neuron	Backprop into input pixels, gradient ascent
SHAP	Per-pixel attribution	Perturb patches, measure output change

Grad-CAM explained

1. Forward pass — store last conv layer activations A^k (H×W×K) 2. Backprop — compute gradients ∂y_c/∂A^k for class c 3. Global average pool the gradients: α_k = (1/HW) Σᵢⱼ ∂y_c/∂A^k_ij ← importance of channel k 4. Weighted combination + ReLU: L_CAM = ReLU(Σ_k α_k · A^k) ← heatmap (H×W) 5. Upsample heatmap to input resolution and overlay

Fine-tuning pre-trained models

1

Load pre-trained model

Load ResNet/EfficientNet/ViT pre-trained on ImageNet. It has learned rich features: edges → textures → object parts → objects across 1.2M images.

2

Replace the classification head

Remove the final FC layer (1000 ImageNet classes). Add a new FC layer for your number of classes, randomly initialized.

3

Freeze early layers

Freeze layers 1–N. Only the new head is trainable. Early layers detect generic features that transfer — no need to relearn them.

4

Train with small LR

Use LR ≈ 1e-4 or smaller. Use discriminative learning rates — tiny LR for frozen/early layers, larger for the new head.

5

Progressive unfreezing

After convergence, unfreeze more layers and continue with an even smaller LR. Repeat until fully unfrozen if needed.

Normalize inputs with ImageNet stats: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] — pre-trained models expect this exact distribution.

Classic architectures

Model	Key innovation	Params
AlexNet (2012)	First deep CNN to win ImageNet; introduced ReLU + dropout	60M
VGGNet (2014)	Only 3×3 convs stacked deep — depth matters	138M
ResNet (2015)	Skip connections — enables 100+ layer training	25M (50-layer)
EfficientNet (2019)	Compound scaling of width + depth + resolution jointly	5–66M
ViT (2020)	Pure transformer on image patches — no convolutions	86M (Base)

RNN / LSTM / GRU & Attention

Sequence models that maintain state across time steps — essential for text, speech, time series, and transliteration.

Vanilla RNN

h_t = tanh(W_hh · h_{t-1} + W_xh · x_t + b_h) y_t = W_hy · h_t + b_y h_t is the "memory" — carries information from all previous steps. Problem: gradients are multiplied by W_hh at EVERY step. If eigenvalues of W_hh < 1 → gradients shrink exponentially → vanishing.

An RNN with 50 time steps must backprop through 50 matrix multiplications. With W_hh having eigenvalues < 1, the gradient at step 1 is W_hh^50 ≈ 0. The network forgets anything beyond ~10 steps.

LSTM — Long Short-Term Memory

Introduces a cell state C_t — a highway for gradients that flows with only element-wise operations (no matrix multiply). Three gates control information flow:

Forget gate: f_t = σ(W_f · [h_{t-1}, x_t] + b_f) ← what to erase from C Input gate: i_t = σ(W_i · [h_{t-1}, x_t] + b_i) ← what to write to C Candidate: C̃_t = tanh(W_C · [h_{t-1}, x_t] + b_C) ← what to write Cell update: C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t ← update cell state Output gate: o_t = σ(W_o · [h_{t-1}, x_t] + b_o) ← what to expose Hidden state: h_t = o_t ⊙ tanh(C_t)

The cell state C_t flows through with only ⊙ (element-wise multiply) and + operations — no weight matrix in the gradient path. This is why gradients don't vanish.

GRU — Gated Recurrent Unit

Simplified LSTM: merges forget + input into one update gate, and merges cell state + hidden state. Fewer parameters, often matches LSTM:

Reset gate: r_t = σ(W_r · [h_{t-1}, x_t]) Update gate: z_t = σ(W_z · [h_{t-1}, x_t]) Candidate: h̃_t = tanh(W · [r_t ⊙ h_{t-1}, x_t]) New state: h_t = (1 - z_t) ⊙ h_{t-1} + z_t ⊙ h̃_t ← interpolate old and new

Use LSTM when

Long sequences with complex dependencies. Large datasets where extra parameters help. Need separate cell and hidden state control.

Use GRU when

Smaller datasets. Faster training needed. Simpler dependencies. Usually competitive with LSTM at lower computational cost.

Seq2Seq for transliteration

Encoder: reads source chars x₁...xₙ → final hidden state h_n Decoder: generates target chars y₁...yₘ using h_n The entire source sequence is compressed into ONE fixed-size vector. This is the bottleneck — works for short sequences, fails for long ones.

Attention mechanism

Instead of one fixed context vector, the decoder attends to ALL encoder hidden states at each decoding step:

score(h_s, h_t) = h_t^T · W_a · h_s ← alignment score (additive: v·tanh(W₁h_s + W₂h_t)) α_{ts} = softmax(score) over all s ← attention weights (sum to 1) context_t = Σ_s α_{ts} · h_s ← weighted sum of encoder states output_t = decoder(context_t, h_{t-1}, y_{t-1})

For transliteration, attention learns near-monotonic alignment but handles 1-to-many character mappings (one Hindi character → 2 Latin chars). This directly improved the BLEU score.

Types of attention

Type	Score function	Notes
Dot-product	h_s · h_t	Fast, requires same dimension
Scaled dot-product	(h_s · h_t) / √d_k	Used in Transformers — prevents softmax saturation
Additive (Bahdanau)	v^T tanh(W₁h_s + W₂h_t)	Original attention paper, learned alignment
Multi-head	Concat h heads of scaled dot	Transformers — attends to multiple positions simultaneously

Naive Bayes

A probabilistic classifier based on Bayes' theorem with one bold assumption: all features are conditionally independent given the class label.

Bayes' theorem

P(y | x) = P(x | y) · P(y) / P(x) Posterior = Likelihood × Prior / Evidence For classification we compare classes, so P(x) is constant — we ignore it: argmax_y P(y | x) = argmax_y P(x | y) · P(y)

The naive assumption

P(x₁, x₂, ..., xₙ | y) = P(x₁|y) · P(x₂|y) · ... · P(xₙ|y). Features are assumed independent given the class. Almost never true in reality — but works surprisingly well, especially for text classification.

Training — what we estimate from data

Prior: P(y=c) = count(y=c) / N Likelihood: P(xᵢ=v|y=c) = count(xᵢ=v in class c) / count(class c) For continuous features → Gaussian NB: P(xᵢ|y=c) = (1/√(2πσ²_ic)) · exp(-(xᵢ-μ_ic)² / 2σ²_ic) Estimate μ_ic and σ²_ic per feature per class from training data.

Laplace smoothing — handle unseen values

Problem: if a word never appeared in class c during training, P(word|c) = 0 Then the entire product P(x|c) = 0 — one unseen word kills the prediction. P(xᵢ=v | y=c) = (count(xᵢ=v, y=c) + α) / (count(y=c) + α · |V|) α = 1 → Laplace smoothing. α > 0 → no zero probabilities ever.

Log probabilities — avoid underflow

Problem: multiplying 1000 probabilities like 0.001 × 0.003 × ... → rounds to 0 in float32 log P(y|x) ∝ log P(y) + Σᵢ log P(xᵢ|y) Products become sums of logs — numerically stable, same argmax result.

Implementation from scratch

class NaiveBayes: def fit(self, X, y): self.classes = np.unique(y) self.priors, self.means, self.vars = {}, {}, {} for c in self.classes: X_c = X[y == c] self.priors[c] = len(X_c) / len(X) self.means[c] = X_c.mean(axis=0) self.vars[c] = X_c.var(axis=0) def _log_likelihood(self, x, mean, var): log_exp = -((x - mean)**2) / (2 * var + 1e-9) log_norm = -0.5 * np.log(2 * np.pi * var + 1e-9) return np.sum(log_norm + log_exp) def predict(self, X): preds = [] for x in X: scores = {c: np.log(self.priors[c]) + self._log_likelihood(x, self.means[c], self.vars[c]) for c in self.classes} preds.append(max(scores, key=scores.get)) return preds

Variants

Variant	Feature type	Likelihood model	Best use case
Gaussian NB	Continuous	Gaussian distribution	Real-valued features (sepal length, etc.)
Multinomial NB	Counts / frequencies	Multinomial distribution	Text classification (word counts, TF-IDF)
Bernoulli NB	Binary (present/absent)	Bernoulli distribution	Short docs, binary bag-of-words
Complement NB	Counts	Complement class stats	Imbalanced text datasets

AdaBoost

Adaptive Boosting: train a sequence of weak learners where each one focuses on the mistakes of its predecessors, then combine with weighted vote.

Core intuition

A "weak learner" only needs to be slightly better than random (>50% accuracy). AdaBoost amplifies the weights of misclassified samples so each new learner is forced to focus on the hard cases.

Algorithm — full walkthrough

Initialize: wᵢ = 1/N for all i = 1...N for t = 1 to T: a) Train weak learner h_t on weighted training data b) Weighted error: εₜ = Σᵢ wᵢ · 𝟙[h_t(xᵢ) ≠ yᵢ] c) Learner weight: αₜ = ½ · ln((1 - εₜ) / εₜ) ← large if ε→0, zero if ε=0.5 d) Update weights: wᵢ ← wᵢ · exp(-αₜ · yᵢ · h_t(xᵢ)) ↑ grows for misclassified, shrinks for correct e) Normalize: wᵢ ← wᵢ / Σⱼ wⱼ Final: H(x) = sign(Σₜ αₜ · h_t(x)) ← weighted majority vote

Understanding the learner weight α

When ε → 0 (near-perfect learner): α → ∞ — huge weight, dominates the ensemble. When ε = 0.5 (random guess): α = 0 — completely ignored. When ε > 0.5: α < 0 — flip its predictions!

Weight update explained

Correctly classified (yᵢ·h_t(xᵢ) = +1)

exp(-αₜ) < 1 → weight shrinks. This sample gets less attention in the next round.

Misclassified (yᵢ·h_t(xᵢ) = -1)

exp(+αₜ) > 1 → weight grows. Next learner is forced to focus here.

Implementation from scratch

class DecisionStump: def fit(self, X, y, weights): best = {'err': float('inf')} for feat in range(X.shape[1]): for thresh in np.unique(X[:, feat]): for polarity in [1, -1]: preds = polarity * np.sign(X[:, feat] - thresh) err = np.sum(weights[preds != y]) if err < best['err']: best = {'err':err, 'feat':feat, 'thresh':thresh, 'pol':polarity} self.__dict__.update(best) def predict(self, X): return self.pol * np.sign(X[:, self.feat] - self.thresh) class AdaBoost: def fit(self, X, y, T=50): n = len(y) w = np.ones(n) / n self.stumps, self.alphas = [], [] for _ in range(T): stump = DecisionStump() stump.fit(X, y, w) preds = stump.predict(X) eps = np.sum(w[preds != y]) + 1e-10 alpha = 0.5 * np.log((1 - eps) / eps) w *= np.exp(-alpha * y * preds) w /= w.sum() self.stumps.append(stump) self.alphas.append(alpha) def predict(self, X): return np.sign(sum(a * s.predict(X) for a, s in zip(self.alphas, self.stumps)))

Properties

Training error

Decreases exponentially with T. Provably converges to zero training error if each weak learner beats random chance.

Sensitivity

Resistant to overfitting in practice. But very sensitive to noisy labels and outliers — they get amplified exponentially.

Loss Functions

The loss defines what the model optimizes for. Choosing the wrong one can cripple training even with a perfect architecture.

Interactive comparison — regression losses

Prediction error 2.0

4.000

MSE

2.000

MAE

1.500

Huber (δ=1)

Implementation

import numpy as np # Regression losses def mse(y, yhat): return np.mean((y - yhat)**2) def mae(y, yhat): return np.mean(np.abs(y - yhat)) def huber(y, yhat, d=1): e = y - yhat return np.where(np.abs(e) <= d, 0.5 * e**2, d * (np.abs(e) - 0.5*d)).mean() # Classification losses def bce(y, p): p = np.clip(p, 1e-9, 1-1e-9) return -np.mean(y*np.log(p) + (1-y)*np.log(1-p)) def cross_entropy(y_onehot, logits): p = softmax(logits) return -np.sum(y_onehot * np.log(p + 1e-9)) def softmax(x): e = np.exp(x - np.max(x)) # subtract max for numerical stability return e / e.sum() def hinge(y, s): # y in {-1, +1} return np.mean(np.maximum(0, 1 - y*s)) def kl_div(p, q): return np.sum(p * np.log(p / (q + 1e-9)))

Gradient cheat sheet

Loss	Gradient ∂L/∂ŷ	Key property
MSE	2(ŷ - y) / n	Large errors → large gradient
MAE	sign(ŷ - y) / n	Constant magnitude, sparse
BCE	-y/ŷ + (1-y)/(1-ŷ)	With sigmoid: simplifies to ŷ - y
CE + Softmax	ŷ - y_onehot	Elegantly simple — just predictions minus labels
Hinge	-y if y·s < 1, else 0	Zero gradient for correctly classified (sparse)