Skip to content
Extras/math-deep-dive/optimization-landscape
// companion content · math depth

The Optimization Landscape

The loss surface is a high-dimensional terrain where SGD with momentum acts like a ball rolling downhill.

Instructor

In the Training Loop module, you adjusted learning rates and watched loss curves. But what's the geometry behind those curves? The loss function defines a surface in high-dimensional space, and training is the process of finding the lowest point. The shape of that surface determines everything — whether training converges, how fast, and to what solution.

Learning Objectives

  • Visualize loss functions as surfaces in weight space
  • Distinguish convex from non-convex optimization problems
  • Understand saddle points and why they're more common than local minima in high dimensions
  • Implement SGD with momentum using the ball-rolling-downhill analogy
  • Explain why learning rate schedules improve convergence

The Loss Surface

Imagine plotting loss as a function of two weights. You get a 3D surface — hills, valleys, and ridges. The goal of training is to find the lowest valley. In a real network with millions of weights, this surface exists in millions of dimensions, but the intuition from 3D holds.

Frontend

3D Game Terrain
player.velocity += gravity * dt

Machine Learning

SGD Momentum
velocity = momentum * velocity - lr * gradient
Structural Bridge
⚠ Where this breaks
3D game terrain is authored by a level designer. SGD navigates a million-dimensional non-convex loss landscape with no map; momentum helps escape some local minima but offers no guarantee of finding the global optimum.

In a game engine, a character walks on terrain defined by a heightmap. Gradient descent is the same idea: you're standing on the loss surface, you look which direction goes downhill (the negative gradient), and you take a step that way.

loss-surface.tstypescript
import * as tf from '@tensorflow/tfjs';

// A simple 2D loss function: L(w1, w2) = w1^2 + 3*w2^2
// This is a bowl — convex, one global minimum at (0, 0)
function convexLoss(w1: number, w2: number): number {
return w1 * w1 + 3 * w2 * w2;
}

// Gradient: [dL/dw1, dL/dw2] = [2*w1, 6*w2]
function convexGradient(w1: number, w2: number): [number, number] {
return [2 * w1, 6 * w2];
}

// Vanilla gradient descent
let w1 = 5.0, w2 = 3.0;
const lr = 0.1;

for (let step = 0; step < 20; step++) {
const [g1, g2] = convexGradient(w1, w2);
w1 -= lr * g1;
w2 -= lr * g2;
console.log(`Step ${step}: w=[${w1.toFixed(3)}, ${w2.toFixed(3)}] loss=${convexLoss(w1, w2).toFixed(4)}`);
}
// Converges smoothly to (0, 0)

Convexity and Non-Convexity

A convex function is bowl-shaped: any line between two points on the surface stays above the surface. This guarantees a single global minimum. Linear regression loss is convex — gradient descent always finds the best answer.

Neural network loss functions are non-convex — they have multiple valleys, ridges, and saddle points. There's no guarantee you'll find the global minimum.

non-convex.tstypescript
// A non-convex 1D loss function with multiple minima
function nonConvexLoss(w: number): number {
return Math.sin(3 * w) + 0.5 * w * w - w;
}

function nonConvexGradient(w: number): number {
return 3 * Math.cos(3 * w) + w - 1;
}

// Starting from different points leads to different minima
for (const start of [-2.0, 0.0, 2.0, 4.0]) {
let w = start;
const lr = 0.05;
for (let i = 0; i < 100; i++) {
  w -= lr * nonConvexGradient(w);
}
console.log(`Start=${start.toFixed(1)} -> converged to w=${w.toFixed(4)}, loss=${nonConvexLoss(w).toFixed(4)}`);
}
// Different starting points, different answers — this is non-convex optimization

Saddle Points

In high dimensions, local minima are rare. Saddle points are far more common — points where the gradient is zero but the surface curves up in some directions and down in others. Think of a mountain pass: it's the lowest point along the ridge but the highest point along the valley.

saddle-point.tstypescript
// Saddle point example: f(x, y) = x^2 - y^2
// At (0, 0): gradient is [0, 0] but it's NOT a minimum
// It curves up in x, down in y — a saddle

function saddleLoss(x: number, y: number): number {
return x * x - y * y;
}

function saddleGradient(x: number, y: number): [number, number] {
return [2 * x, -2 * y];
}

// Plain gradient descent gets stuck at the saddle
let x = 0.001, y = 0.001;
const lr = 0.1;
for (let i = 0; i < 10; i++) {
const [gx, gy] = saddleGradient(x, y);
x -= lr * gx;
y -= lr * gy;
console.log(`Step ${i}: (${x.toFixed(6)}, ${y.toFixed(6)}) loss=${saddleLoss(x, y).toFixed(6)}`);
}
// y escapes (gradient pushes it away), but shows the saddle dynamics

SGD with Momentum

Momentum solves two problems: it helps escape saddle points and accelerates through flat regions. The physics analogy is perfect — a ball rolling downhill accumulates velocity.

sgd-momentum.tstypescript
import * as tf from '@tensorflow/tfjs';

// SGD with momentum — the ball-rolling-downhill optimizer
function sgdMomentum(
lossGradFn: (w: number[]) => number[],
initialWeights: number[],
lr: number,
momentum: number,
steps: number
) {
const weights = [...initialWeights];
const velocity = new Array(weights.length).fill(0);

for (let step = 0; step < steps; step++) {
  const grads = lossGradFn(weights);

  for (let i = 0; i < weights.length; i++) {
    // Physics: v = momentum * v - lr * gradient
    velocity[i] = momentum * velocity[i] - lr * grads[i];
    // Physics: position += velocity
    weights[i] += velocity[i];
  }
}
return weights;
}

// Compare: plain SGD vs momentum on a narrow valley
// L(w1, w2) = 0.5 * w1^2 + 50 * w2^2
// This is like a long, narrow canyon — hard for plain SGD
const lossGrad = (w: number[]): number[] => [w[0], 100 * w[1]];

const plainResult = sgdMomentum(lossGrad, [10, 1], 0.005, 0, 200);
const momentumResult = sgdMomentum(lossGrad, [10, 1], 0.005, 0.9, 200);

console.log('Plain SGD:', plainResult.map(v => v.toFixed(4)));
console.log('Momentum:', momentumResult.map(v => v.toFixed(4)));
// Momentum converges much faster in the narrow valley

Challenge

Build a visualization of gradient descent on a loss surface and implement momentum.

Exercise

AdvancedArithmetic~20 min

Visualize Loss Surface

Implement SGD with momentum to minimize a non-convex loss function. (1) `sgdStep` takes the current weight, gradient, learning rate, current velocity, and momentum coefficient, and returns an object { weight, velocity } after one momentum update. The update rules are: newVelocity = momentum * velocity - lr * gradient, newWeight = weight + newVelocity. (2) `optimizeWithMomentum` runs SGD with momentum for a given number of steps on the loss function L(w) = sin(3w) + 0.5*w^2 - w (gradient: 3*cos(3w) + w - 1). Return an array of { weight, loss } objects for each step including the initial state.

# bridge

3D Game TerrainSGD Momentum

Key Takeaways

  • The loss function defines a surface in weight space — training navigates this terrain
  • Convex functions have one minimum (bowl-shaped); neural network losses are non-convex with many valleys
  • Saddle points (zero gradient but not a minimum) are more common than local minima in high dimensions
  • SGD with momentum accumulates velocity like a rolling ball — it escapes saddle points and accelerates through flat regions
  • Learning rate controls step size; too large overshoots, too small gets stuck

Need a hint?

🧭 Guidance
Solution
Report Issue
0/2000
Severity
Screenshot
+ Attach screenshot (optional)
page url + browser info captured automatically