How a neural network learns — computing gradients via the chain rule and propagating them backward through every layer.
The big picture
Training has two phases. The forward pass computes a prediction. The backward pass figures out how much each weight was responsible for the error, then nudges every weight in the direction that reduces it.
The key insight: the chain rule from calculus lets us break a complex derivative into a product of simpler ones — each layer only needs to know its own local gradient.
Forward pass — layer by layer
Each layer does the same two operations: a linear transform, then a non-linearity.
z = W·x + b ← pre-activation (linear)a = σ(z) ← post-activation (non-linear)L = loss(a_final, y) ← scalar error at the end
Without non-linearities: W3(W2(W1·x)) = W_combined·x — the entire deep network collapses to a single linear model. Depth becomes meaningless.
Backward pass — the chain rule
To update a weight W¹ in the first layer, we need ∂L/∂W¹. The chain rule lets us factor this:
∂L/∂W¹ = ∂L/∂a³ · ∂a³/∂z³ · ∂z³/∂a² · ∂a²/∂z² · ∂z²/∂a¹ · ∂a¹/∂z¹ · ∂z¹/∂W¹Computed RIGHT to LEFT — from loss back toward inputs.Each layer stores its δ (local error signal) and passes it back.
δ_output = ∂L/∂z = (a - y) ← for MSE loss
δ_hidden = (W_next^T · δ_next) ⊙ σ'(z) ← propagate backward
∂L/∂W = δ · a_prev^T
∂L/∂b = δ
Activation functions — live demo
Vanishing & exploding gradients
Vanishing
Sigmoid/tanh saturate → derivatives near 0 → gradients shrink exponentially through layers → early layers learn nothing.
During training, randomly zero out each neuron with probability p. Forces the network to learn redundant representations — no single neuron gets relied upon.
Train time (inverted dropout):
mask = Bernoulli(1-p)
a_drop = a ⊙ mask / (1-p) ← scale up so expected value stays sameTest time:
a_out = a ← use all neurons as-is, no scaling needed
Optimization Algorithms
Six algorithms that solve the same problem — update weights to minimize loss — but differ in how they use gradient history.
Side-by-side comparison
Optimizer
Extra memory
Adaptive LR
Best for
Pitfall
SGD
None
No
Vision with tuning
Slow, oscillates in ravines
Momentum
v
No
Convex + noisy loss
Can overshoot
NAG
v
No
Slightly better than momentum
More complex to implement
Adagrad
G (sum sq.)
Yes
Sparse NLP features
LR decays to zero
RMSProp
v²
Yes
RNNs, non-stationary
No bias correction
Adam
m, v
Yes
Default choice, NLP
May generalize worse than SGD
Gradient descent variants
Batch GD
Full dataset per update. Stable gradient estimate but extremely slow. Not practical for large datasets.
SGD
One sample per update. Noisy but fast. Noise helps escape local minima. Still the term used for mini-batch in practice.
Mini-batch
Best of both worlds — batch size 32–256. GPU-parallelizable. Standard in practice.
Learning rate scheduling
Step decay: lr = lr₀ × γ^⌊epoch/step_size⌋
Cosine decay: lr = lr_min + ½(lr_max - lr_min)(1 + cos(πt/T))
Warmup: lr increases linearly for k steps, then decays
OneCycleLR: warmup → peak → cosine decay (fast convergence)
Warmup is critical when fine-tuning transformers — a large learning rate at the start will destroy pre-trained weights.
Convolutional Neural Networks
CNNs exploit the spatial structure of images using weight sharing, local connectivity, and hierarchical feature learning.
Why convolutions?
Weight sharing
The same 3×3 filter slides across the entire image. An edge detector learned in the top-left corner works everywhere. Vastly fewer parameters than a dense layer.
Output size = ⌊(W - F + 2P) / S⌋ + 1
W = input size, F = filter size, P = padding, S = stride
Parameters = F × F × C_in × C_out + C_out (biases)
Example: 3×3 conv, 64 → 128 channels
= 3×3×64×128 + 128 = 73,856 params
Dense equivalent would be: 64×H×W × 128×H×W = millions
Batch normalization
x_norm = (x - μ_batch) / √(σ²_batch + ε)
y = γ · x_norm + β ← learned scale and shift per channel
Benefits: faster training, acts as regularizer, allows higher learning rates, reduces sensitivity to initialization. Applied before or after activation (before is more common now).
Feature visualization techniques
Technique
What it shows
How
Filter viz
What pattern a filter detects
Display learned conv1 weight kernels directly
Activation maps
Where a filter fires on an input
Forward pass → visualize intermediate feature map
Grad-CAM
Where model looked for its decision
Gradients into last conv layer → weighted spatial map
Activation max
What input maximally excites a neuron
Backprop into input pixels, gradient ascent
SHAP
Per-pixel attribution
Perturb patches, measure output change
Grad-CAM explained
1. Forward pass — store last conv layer activations A^k (H×W×K)2. Backprop — compute gradients ∂y_c/∂A^k for class c3. Global average pool the gradients:
α_k = (1/HW) Σᵢⱼ ∂y_c/∂A^k_ij ← importance of channel k4. Weighted combination + ReLU:
L_CAM = ReLU(Σ_k α_k · A^k) ← heatmap (H×W)5. Upsample heatmap to input resolution and overlay
Fine-tuning pre-trained models
1
Load pre-trained model
Load ResNet/EfficientNet/ViT pre-trained on ImageNet. It has learned rich features: edges → textures → object parts → objects across 1.2M images.
2
Replace the classification head
Remove the final FC layer (1000 ImageNet classes). Add a new FC layer for your number of classes, randomly initialized.
3
Freeze early layers
Freeze layers 1–N. Only the new head is trainable. Early layers detect generic features that transfer — no need to relearn them.
4
Train with small LR
Use LR ≈ 1e-4 or smaller. Use discriminative learning rates — tiny LR for frozen/early layers, larger for the new head.
5
Progressive unfreezing
After convergence, unfreeze more layers and continue with an even smaller LR. Repeat until fully unfrozen if needed.
Normalize inputs with ImageNet stats: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] — pre-trained models expect this exact distribution.
Classic architectures
Model
Key innovation
Params
AlexNet (2012)
First deep CNN to win ImageNet; introduced ReLU + dropout
60M
VGGNet (2014)
Only 3×3 convs stacked deep — depth matters
138M
ResNet (2015)
Skip connections — enables 100+ layer training
25M (50-layer)
EfficientNet (2019)
Compound scaling of width + depth + resolution jointly
5–66M
ViT (2020)
Pure transformer on image patches — no convolutions
86M (Base)
RNN / LSTM / GRU & Attention
Sequence models that maintain state across time steps — essential for text, speech, time series, and transliteration.
Vanilla RNN
h_t = tanh(W_hh · h_{t-1} + W_xh · x_t + b_h)
y_t = W_hy · h_t + b_y
h_t is the "memory" — carries information from all previous steps.Problem: gradients are multiplied by W_hh at EVERY step.If eigenvalues of W_hh < 1 → gradients shrink exponentially → vanishing.
An RNN with 50 time steps must backprop through 50 matrix multiplications. With W_hh having eigenvalues < 1, the gradient at step 1 is W_hh^50 ≈ 0. The network forgets anything beyond ~10 steps.
LSTM — Long Short-Term Memory
Introduces a cell state C_t — a highway for gradients that flows with only element-wise operations (no matrix multiply). Three gates control information flow:
Forget gate: f_t = σ(W_f · [h_{t-1}, x_t] + b_f) ← what to erase from C
Input gate: i_t = σ(W_i · [h_{t-1}, x_t] + b_i) ← what to write to C
Candidate: C̃_t = tanh(W_C · [h_{t-1}, x_t] + b_C) ← what to write
Cell update: C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t ← update cell state
Output gate: o_t = σ(W_o · [h_{t-1}, x_t] + b_o) ← what to expose
Hidden state: h_t = o_t ⊙ tanh(C_t)
The cell state C_t flows through with only ⊙ (element-wise multiply) and + operations — no weight matrix in the gradient path. This is why gradients don't vanish.
GRU — Gated Recurrent Unit
Simplified LSTM: merges forget + input into one update gate, and merges cell state + hidden state. Fewer parameters, often matches LSTM:
Long sequences with complex dependencies. Large datasets where extra parameters help. Need separate cell and hidden state control.
Use GRU when
Smaller datasets. Faster training needed. Simpler dependencies. Usually competitive with LSTM at lower computational cost.
Seq2Seq for transliteration
Encoder: reads source chars x₁...xₙ → final hidden state h_n
Decoder: generates target chars y₁...yₘ using h_n
The entire source sequence is compressed into ONE fixed-size vector.This is the bottleneck — works for short sequences, fails for long ones.
Attention mechanism
Instead of one fixed context vector, the decoder attends to ALL encoder hidden states at each decoding step:
score(h_s, h_t) = h_t^T · W_a · h_s ← alignment score (additive: v·tanh(W₁h_s + W₂h_t))
α_{ts} = softmax(score) over all s ← attention weights (sum to 1)
context_t = Σ_s α_{ts} · h_s ← weighted sum of encoder states
output_t = decoder(context_t, h_{t-1}, y_{t-1})
For transliteration, attention learns near-monotonic alignment but handles 1-to-many character mappings (one Hindi character → 2 Latin chars). This directly improved the BLEU score.
Types of attention
Type
Score function
Notes
Dot-product
h_s · h_t
Fast, requires same dimension
Scaled dot-product
(h_s · h_t) / √d_k
Used in Transformers — prevents softmax saturation
Additive (Bahdanau)
v^T tanh(W₁h_s + W₂h_t)
Original attention paper, learned alignment
Multi-head
Concat h heads of scaled dot
Transformers — attends to multiple positions simultaneously
Naive Bayes
A probabilistic classifier based on Bayes' theorem with one bold assumption: all features are conditionally independent given the class label.
Bayes' theorem
P(y | x) = P(x | y) · P(y) / P(x)
Posterior = Likelihood × Prior / Evidence
For classification we compare classes, so P(x) is constant — we ignore it:
argmax_y P(y | x) = argmax_y P(x | y) · P(y)
The naive assumption
P(x₁, x₂, ..., xₙ | y) = P(x₁|y) · P(x₂|y) · ... · P(xₙ|y). Features are assumed independent given the class. Almost never true in reality — but works surprisingly well, especially for text classification.
Training — what we estimate from data
Prior: P(y=c) = count(y=c) / N
Likelihood: P(xᵢ=v|y=c) = count(xᵢ=v in class c) / count(class c)
For continuous features → Gaussian NB:
P(xᵢ|y=c) = (1/√(2πσ²_ic)) · exp(-(xᵢ-μ_ic)² / 2σ²_ic)
Estimate μ_ic and σ²_ic per feature per class from training data.
Laplace smoothing — handle unseen values
Problem: if a word never appeared in class c during training, P(word|c) = 0Then the entire product P(x|c) = 0 — one unseen word kills the prediction.
P(xᵢ=v | y=c) = (count(xᵢ=v, y=c) + α) / (count(y=c) + α · |V|)
α = 1 → Laplace smoothing. α > 0 → no zero probabilities ever.
Log probabilities — avoid underflow
Problem: multiplying 1000 probabilities like 0.001 × 0.003 × ... → rounds to 0 in float32
log P(y|x) ∝ log P(y) + Σᵢ log P(xᵢ|y)
Products become sums of logs — numerically stable, same argmax result.
Implementation from scratch
classNaiveBayes:
deffit(self, X, y):
self.classes = np.unique(y)
self.priors, self.means, self.vars = {}, {}, {}
for c in self.classes:
X_c = X[y == c]
self.priors[c] = len(X_c) / len(X)
self.means[c] = X_c.mean(axis=0)
self.vars[c] = X_c.var(axis=0)
def_log_likelihood(self, x, mean, var):
log_exp = -((x - mean)**2) / (2 * var + 1e-9)
log_norm = -0.5 * np.log(2 * np.pi * var + 1e-9)
return np.sum(log_norm + log_exp)
defpredict(self, X):
preds = []
for x in X:
scores = {c: np.log(self.priors[c]) +
self._log_likelihood(x, self.means[c], self.vars[c])
for c in self.classes}
preds.append(max(scores, key=scores.get))
return preds
Variants
Variant
Feature type
Likelihood model
Best use case
Gaussian NB
Continuous
Gaussian distribution
Real-valued features (sepal length, etc.)
Multinomial NB
Counts / frequencies
Multinomial distribution
Text classification (word counts, TF-IDF)
Bernoulli NB
Binary (present/absent)
Bernoulli distribution
Short docs, binary bag-of-words
Complement NB
Counts
Complement class stats
Imbalanced text datasets
AdaBoost
Adaptive Boosting: train a sequence of weak learners where each one focuses on the mistakes of its predecessors, then combine with weighted vote.
Core intuition
A "weak learner" only needs to be slightly better than random (>50% accuracy). AdaBoost amplifies the weights of misclassified samples so each new learner is forced to focus on the hard cases.
Algorithm — full walkthrough
Initialize: wᵢ = 1/N for all i = 1...N
for t = 1 to T:
a) Train weak learner h_t on weighted training data
b) Weighted error: εₜ = Σᵢ wᵢ · 𝟙[h_t(xᵢ) ≠ yᵢ]
c) Learner weight: αₜ = ½ · ln((1 - εₜ) / εₜ) ← large if ε→0, zero if ε=0.5d) Update weights: wᵢ ← wᵢ · exp(-αₜ · yᵢ · h_t(xᵢ))
↑ grows for misclassified, shrinks for correcte) Normalize: wᵢ ← wᵢ / Σⱼ wⱼ
Final: H(x) = sign(Σₜ αₜ · h_t(x)) ← weighted majority vote
Understanding the learner weight α
When ε → 0 (near-perfect learner): α → ∞ — huge weight, dominates the ensemble. When ε = 0.5 (random guess): α = 0 — completely ignored. When ε > 0.5: α < 0 — flip its predictions!
Weight update explained
Correctly classified (yᵢ·h_t(xᵢ) = +1)
exp(-αₜ) < 1 → weight shrinks. This sample gets less attention in the next round.
Misclassified (yᵢ·h_t(xᵢ) = -1)
exp(+αₜ) > 1 → weight grows. Next learner is forced to focus here.
Implementation from scratch
classDecisionStump:
deffit(self, X, y, weights):
best = {'err': float('inf')}
for feat inrange(X.shape[1]):
for thresh in np.unique(X[:, feat]):
for polarity in [1, -1]:
preds = polarity * np.sign(X[:, feat] - thresh)
err = np.sum(weights[preds != y])
if err < best['err']:
best = {'err':err, 'feat':feat, 'thresh':thresh, 'pol':polarity}
self.__dict__.update(best)
defpredict(self, X):
return self.pol * np.sign(X[:, self.feat] - self.thresh)
classAdaBoost:
deffit(self, X, y, T=50):
n = len(y)
w = np.ones(n) / n
self.stumps, self.alphas = [], []
for _ inrange(T):
stump = DecisionStump()
stump.fit(X, y, w)
preds = stump.predict(X)
eps = np.sum(w[preds != y]) + 1e-10
alpha = 0.5 * np.log((1 - eps) / eps)
w *= np.exp(-alpha * y * preds)
w /= w.sum()
self.stumps.append(stump)
self.alphas.append(alpha)
defpredict(self, X):
return np.sign(sum(a * s.predict(X)
for a, s inzip(self.alphas, self.stumps)))
Properties
Training error
Decreases exponentially with T. Provably converges to zero training error if each weak learner beats random chance.
Sensitivity
Resistant to overfitting in practice. But very sensitive to noisy labels and outliers — they get amplified exponentially.
Loss Functions
The loss defines what the model optimizes for. Choosing the wrong one can cripple training even with a perfect architecture.
Interactive comparison — regression losses
2.0
4.000
MSE
2.000
MAE
1.500
Huber (δ=1)
Implementation
import numpy as np
# Regression lossesdefmse(y, yhat): return np.mean((y - yhat)**2)
defmae(y, yhat): return np.mean(np.abs(y - yhat))
defhuber(y, yhat, d=1):
e = y - yhat
return np.where(np.abs(e) <= d,
0.5 * e**2,
d * (np.abs(e) - 0.5*d)).mean()
# Classification lossesdefbce(y, p):
p = np.clip(p, 1e-9, 1-1e-9)
return -np.mean(y*np.log(p) + (1-y)*np.log(1-p))
defcross_entropy(y_onehot, logits):
p = softmax(logits)
return -np.sum(y_onehot * np.log(p + 1e-9))
defsoftmax(x):
e = np.exp(x - np.max(x)) # subtract max for numerical stabilityreturn e / e.sum()
defhinge(y, s): # y in {-1, +1}return np.mean(np.maximum(0, 1 - y*s))
defkl_div(p, q):
return np.sum(p * np.log(p / (q + 1e-9)))