Information Theory for ML
Entropy quantifies uncertainty, cross-entropy measures prediction quality, and KL divergence measures the distance between distributions.
Every classification model you've trained used cross-entropy loss. But why cross-entropy? Why not mean squared error for classification? The answer comes from information theory — a field that quantifies uncertainty, surprise, and the cost of being wrong. Once you understand it, loss function selection becomes a principled decision, not a recipe to memorize.
Learning Objectives
- ○Compute entropy as a measure of uncertainty in a distribution
- ○Understand cross-entropy loss as measuring the information gap between predictions and truth
- ○Calculate KL divergence and understand its asymmetry
- ○Connect entropy to compression ratios — the bridge to frontend intuition
- ○Choose appropriate loss functions based on information-theoretic principles
Entropy: Measuring Surprise
You've used gzip on web assets. Files with repetitive content compress well (low entropy); random data doesn't compress at all (high entropy). Entropy is literally the theoretical minimum number of bits needed to encode a message.
Frontend
Gzip Compression
const ratio = compressed.length / original.lengthMachine Learning
Entropy
const H = -probs.reduce((s, p) => s + p * Math.log2(p), 0)// Entropy: H(X) = -sum(p(x) * log2(p(x)))
// It measures the average surprise of events from a distribution
function entropy(probs: number[]): number {
return -probs.reduce((sum, p) => {
if (p === 0) return sum; // 0 * log(0) = 0 by convention
return sum + p * Math.log2(p);
}, 0);
}
// Fair coin: maximum uncertainty
console.log('Fair coin:', entropy([0.5, 0.5]).toFixed(4));
// 1.0000 bit — you need exactly 1 bit (0 or 1) to encode each flip
// Biased coin (90% heads)
console.log('Biased coin:', entropy([0.9, 0.1]).toFixed(4));
// 0.4690 bits — less surprise, more compressible
// Certain outcome (100% heads)
console.log('Certain:', entropy([1.0, 0.0]).toFixed(4));
// 0.0000 bits — no surprise at all, nothing to encode
// Uniform distribution over 8 classes
const uniform8 = Array(8).fill(1/8);
console.log('Uniform 8-class:', entropy(uniform8).toFixed(4));
// 3.0000 bits — need 3 bits to encode 8 equally likely outcomes
// Connection to gzip:
// High-entropy file (random bytes) -> poor compression ratio
// Low-entropy file (repeated patterns) -> great compression ratio
// Entropy IS the compression limitCross-Entropy: Measuring Prediction Quality
Cross-entropy H(p, q) asks: "If the true distribution is p, but I'm using distribution q to encode messages, how many bits do I waste?" When q matches p perfectly, cross-entropy equals entropy (no waste). When q is wrong, you pay extra bits.
import * as tf from '@tensorflow/tfjs';
function crossEntropy(trueProbs: number[], predictedProbs: number[]): number {
return -trueProbs.reduce((sum, p, i) => {
if (p === 0) return sum;
return sum + p * Math.log2(predictedProbs[i]);
}, 0);
}
// True distribution: cat with 100% certainty
const trueLabel = [1.0, 0.0, 0.0]; // [cat, dog, bird]
// Good prediction
const goodPred = [0.9, 0.05, 0.05];
console.log('Good prediction CE:', crossEntropy(trueLabel, goodPred).toFixed(4));
// 0.1520 bits — low cost, model is close
// Bad prediction
const badPred = [0.1, 0.6, 0.3];
console.log('Bad prediction CE:', crossEntropy(trueLabel, badPred).toFixed(4));
// 3.3219 bits — high cost, model is very wrong
// Terrible prediction (almost certain it's NOT a cat)
const terriblePred = [0.01, 0.90, 0.09];
console.log('Terrible prediction CE:', crossEntropy(trueLabel, terriblePred).toFixed(4));
// 6.6439 bits — massive cost for confident wrong answer
// This is why cross-entropy penalizes confident wrong predictions
// so harshly — the log makes small probabilities very expensive
// TensorFlow.js built-in (uses natural log, not log2)
const trueTensor = tf.tensor1d([1, 0, 0]);
const predTensor = tf.tensor1d([0.9, 0.05, 0.05]);
const ce = tf.losses.softmaxCrossEntropy(
trueTensor.reshape([1, 3]),
predTensor.reshape([1, 3])
);
console.log('TF cross-entropy:', await ce.array());KL Divergence: Distance Between Distributions
KL divergence measures how different distribution q is from distribution p. It's cross-entropy minus entropy: the extra bits wasted by using the wrong distribution.
function klDivergence(p: number[], q: number[]): number {
return p.reduce((sum, pi, i) => {
if (pi === 0) return sum;
return sum + pi * Math.log2(pi / q[i]);
}, 0);
}
const p = [0.7, 0.2, 0.1]; // true distribution
const q = [0.3, 0.4, 0.3]; // model's distribution
console.log('KL(p || q):', klDivergence(p, q).toFixed(4));
// Positive — q is not a perfect model of p
console.log('KL(q || p):', klDivergence(q, p).toFixed(4));
// Different value! KL divergence is ASYMMETRIC
// KL(p||q) != KL(q||p)
// This asymmetry has practical consequences:
// - Minimizing KL(p||q) = minimizing cross-entropy (what we do in training)
// - Minimizing KL(q||p) = mode-seeking (used in variational inference)
// KL divergence of p from itself
console.log('KL(p || p):', klDivergence(p, p).toFixed(4));
// 0.0000 — zero when distributions match
// Why cross-entropy loss works for classification:
// Minimizing CE(true, predicted) is equivalent to minimizing
// KL(true || predicted), because the entropy of the true labels
// is constant. We're directly minimizing the information gap.Why Cross-Entropy and Not MSE for Classification?
// Consider a confident wrong prediction: true=[1,0], pred=[0.01, 0.99]
const trueLabel = 1.0;
// MSE gradient at pred=0.01:
// d/dp (1 - p)^2 = -2(1-p) = -2(0.99) = -1.98
const mseGrad = -2 * (trueLabel - 0.01);
// Cross-entropy gradient at pred=0.01:
// d/dp -log(p) = -1/p = -1/0.01 = -100
const ceGrad = -1 / 0.01;
console.log('MSE gradient:', mseGrad.toFixed(2)); // -1.98
console.log('CE gradient:', ceGrad.toFixed(2)); // -100.00
// Cross-entropy produces a MUCH stronger gradient for confident
// wrong predictions. This is why CE trains faster for classification:
// it screams "you're wrong!" when the model is confidently incorrect,
// while MSE just whispers.
// Information-theoretic reason: CE directly minimizes the information
// gap. MSE minimizes squared distance, which has no information
// theoretic justification for probability distributions.Challenge
Implement entropy, cross-entropy, and KL divergence from scratch.
Exercise
Compute Entropy
Implement three information-theoretic functions using natural logarithm (Math.log, base e): (1) `entropy` computes the Shannon entropy H(p) = -sum(p_i * ln(p_i)) for a probability distribution array. Treat 0 * ln(0) as 0. (2) `crossEntropy` computes H(p, q) = -sum(p_i * ln(q_i)) between a true distribution p and predicted distribution q. (3) `klDivergence` computes KL(p || q) = sum(p_i * ln(p_i / q_i)), the KL divergence from q to p. Skip terms where p_i is 0.
Key Takeaways
- ✓Entropy measures uncertainty — high entropy means unpredictable (hard to compress), low entropy means predictable
- ✓Cross-entropy measures the cost of using the wrong distribution — it's the loss function for classification
- ✓KL divergence is the extra cost (cross-entropy minus entropy) — it's asymmetric
- ✓Cross-entropy loss produces stronger gradients for confident wrong predictions than MSE
- ✓Minimizing cross-entropy is equivalent to minimizing KL divergence from the true distribution