The Optimization Landscape
The loss surface is a high-dimensional terrain where SGD with momentum acts like a ball rolling downhill.
In the Training Loop module, you adjusted learning rates and watched loss curves. But what's the geometry behind those curves? The loss function defines a surface in high-dimensional space, and training is the process of finding the lowest point. The shape of that surface determines everything — whether training converges, how fast, and to what solution.
Learning Objectives
- ○Visualize loss functions as surfaces in weight space
- ○Distinguish convex from non-convex optimization problems
- ○Understand saddle points and why they're more common than local minima in high dimensions
- ○Implement SGD with momentum using the ball-rolling-downhill analogy
- ○Explain why learning rate schedules improve convergence
The Loss Surface
Imagine plotting loss as a function of two weights. You get a 3D surface — hills, valleys, and ridges. The goal of training is to find the lowest valley. In a real network with millions of weights, this surface exists in millions of dimensions, but the intuition from 3D holds.
Frontend
3D Game Terrain
player.velocity += gravity * dtMachine Learning
SGD Momentum
velocity = momentum * velocity - lr * gradientIn a game engine, a character walks on terrain defined by a heightmap. Gradient descent is the same idea: you're standing on the loss surface, you look which direction goes downhill (the negative gradient), and you take a step that way.
import * as tf from '@tensorflow/tfjs';
// A simple 2D loss function: L(w1, w2) = w1^2 + 3*w2^2
// This is a bowl — convex, one global minimum at (0, 0)
function convexLoss(w1: number, w2: number): number {
return w1 * w1 + 3 * w2 * w2;
}
// Gradient: [dL/dw1, dL/dw2] = [2*w1, 6*w2]
function convexGradient(w1: number, w2: number): [number, number] {
return [2 * w1, 6 * w2];
}
// Vanilla gradient descent
let w1 = 5.0, w2 = 3.0;
const lr = 0.1;
for (let step = 0; step < 20; step++) {
const [g1, g2] = convexGradient(w1, w2);
w1 -= lr * g1;
w2 -= lr * g2;
console.log(`Step ${step}: w=[${w1.toFixed(3)}, ${w2.toFixed(3)}] loss=${convexLoss(w1, w2).toFixed(4)}`);
}
// Converges smoothly to (0, 0)Convexity and Non-Convexity
A convex function is bowl-shaped: any line between two points on the surface stays above the surface. This guarantees a single global minimum. Linear regression loss is convex — gradient descent always finds the best answer.
Neural network loss functions are non-convex — they have multiple valleys, ridges, and saddle points. There's no guarantee you'll find the global minimum.
// A non-convex 1D loss function with multiple minima
function nonConvexLoss(w: number): number {
return Math.sin(3 * w) + 0.5 * w * w - w;
}
function nonConvexGradient(w: number): number {
return 3 * Math.cos(3 * w) + w - 1;
}
// Starting from different points leads to different minima
for (const start of [-2.0, 0.0, 2.0, 4.0]) {
let w = start;
const lr = 0.05;
for (let i = 0; i < 100; i++) {
w -= lr * nonConvexGradient(w);
}
console.log(`Start=${start.toFixed(1)} -> converged to w=${w.toFixed(4)}, loss=${nonConvexLoss(w).toFixed(4)}`);
}
// Different starting points, different answers — this is non-convex optimizationSaddle Points
In high dimensions, local minima are rare. Saddle points are far more common — points where the gradient is zero but the surface curves up in some directions and down in others. Think of a mountain pass: it's the lowest point along the ridge but the highest point along the valley.
// Saddle point example: f(x, y) = x^2 - y^2
// At (0, 0): gradient is [0, 0] but it's NOT a minimum
// It curves up in x, down in y — a saddle
function saddleLoss(x: number, y: number): number {
return x * x - y * y;
}
function saddleGradient(x: number, y: number): [number, number] {
return [2 * x, -2 * y];
}
// Plain gradient descent gets stuck at the saddle
let x = 0.001, y = 0.001;
const lr = 0.1;
for (let i = 0; i < 10; i++) {
const [gx, gy] = saddleGradient(x, y);
x -= lr * gx;
y -= lr * gy;
console.log(`Step ${i}: (${x.toFixed(6)}, ${y.toFixed(6)}) loss=${saddleLoss(x, y).toFixed(6)}`);
}
// y escapes (gradient pushes it away), but shows the saddle dynamicsSGD with Momentum
Momentum solves two problems: it helps escape saddle points and accelerates through flat regions. The physics analogy is perfect — a ball rolling downhill accumulates velocity.
import * as tf from '@tensorflow/tfjs';
// SGD with momentum — the ball-rolling-downhill optimizer
function sgdMomentum(
lossGradFn: (w: number[]) => number[],
initialWeights: number[],
lr: number,
momentum: number,
steps: number
) {
const weights = [...initialWeights];
const velocity = new Array(weights.length).fill(0);
for (let step = 0; step < steps; step++) {
const grads = lossGradFn(weights);
for (let i = 0; i < weights.length; i++) {
// Physics: v = momentum * v - lr * gradient
velocity[i] = momentum * velocity[i] - lr * grads[i];
// Physics: position += velocity
weights[i] += velocity[i];
}
}
return weights;
}
// Compare: plain SGD vs momentum on a narrow valley
// L(w1, w2) = 0.5 * w1^2 + 50 * w2^2
// This is like a long, narrow canyon — hard for plain SGD
const lossGrad = (w: number[]): number[] => [w[0], 100 * w[1]];
const plainResult = sgdMomentum(lossGrad, [10, 1], 0.005, 0, 200);
const momentumResult = sgdMomentum(lossGrad, [10, 1], 0.005, 0.9, 200);
console.log('Plain SGD:', plainResult.map(v => v.toFixed(4)));
console.log('Momentum:', momentumResult.map(v => v.toFixed(4)));
// Momentum converges much faster in the narrow valleyChallenge
Build a visualization of gradient descent on a loss surface and implement momentum.
Exercise
Visualize Loss Surface
Implement SGD with momentum to minimize a non-convex loss function. (1) `sgdStep` takes the current weight, gradient, learning rate, current velocity, and momentum coefficient, and returns an object { weight, velocity } after one momentum update. The update rules are: newVelocity = momentum * velocity - lr * gradient, newWeight = weight + newVelocity. (2) `optimizeWithMomentum` runs SGD with momentum for a given number of steps on the loss function L(w) = sin(3w) + 0.5*w^2 - w (gradient: 3*cos(3w) + w - 1). Return an array of { weight, loss } objects for each step including the initial state.
Key Takeaways
- ✓The loss function defines a surface in weight space — training navigates this terrain
- ✓Convex functions have one minimum (bowl-shaped); neural network losses are non-convex with many valleys
- ✓Saddle points (zero gradient but not a minimum) are more common than local minima in high dimensions
- ✓SGD with momentum accumulates velocity like a rolling ball — it escapes saddle points and accelerates through flat regions
- ✓Learning rate controls step size; too large overshoots, too small gets stuck