Gradient Descent

297. Gradient Descent

The gradient tells us the direction of steepest increase

For a function

𝑓 (𝑥_{1}, 𝑥_{2}, \dots, 𝑥_{𝑛})

the gradient of $𝑓$ , written $\nabla 𝑓$ , is:

\nabla 𝑓 (𝑥) = (\frac{𝜕 𝑓}{𝜕 𝑥_{1}}, \frac{𝜕 𝑓}{𝜕 𝑥_{2}}, \dots, \frac{𝜕 𝑓}{𝜕 𝑥_{𝑛}})

The magnitude of the gradient vector tells you how fast the function is increasing (steepness)
The direction of the gradient vector points to where the function increases most rapidly (direction)

Example

Function

𝑓 (𝑥_{1}, 𝑥_{2}) = 𝑥_{1}^{2} + 𝑥_{2}^{2}

Compute $\frac{𝜕 𝑑}{𝜕 𝑥_{1}}$

Differentiate $𝑓$ w.r.t. $𝑥_{1}$ (treat $𝑥_{2}$ as constant):

\frac{𝜕}{𝜕 𝑥_{1}} (𝑥_{1}^{2} + 𝑥_{2}^{2}) = \frac{𝜕}{𝜕 𝑥_{1}} (𝑥_{1}^{2}) + \frac{𝜕}{𝜕 𝑥_{1}} (𝑥_{2}^{2})

Since $𝑥_{2}^{2}$ is constant w.r.t. $𝑥_{1}$ , its derivative is 0:

\frac{𝜕}{𝜕 𝑥_{1}} (𝑥_{2}^{2}) = 0

The derivative of $𝑥_{1}^{2}$ w.r.t. $𝑥_{1}$ is:

\frac{𝜕}{𝜕 𝑥_{1}} (𝑥_{1}^{2}) = 2 𝑥_{1}

\frac{𝜕 𝑓}{𝜕 𝑥_{1}} = 2 𝑥_{1}

Compute $\frac{𝜕 𝑓}{𝜕 𝑥_{2}}$

Differentiate $𝑓$ w.r.t. $𝑥_{2}$ (treat $𝑥_{2}$ as constant):

\frac{𝜕}{𝜕 𝑥_{2}} (𝑥_{1}^{2} + 𝑥_{2}^{2}) = \frac{𝜕}{𝜕 𝑥_{2}} (𝑥_{1}^{2}) + \frac{𝜕}{𝜕 𝑥_{2}} (𝑥_{2}^{2})

Since $𝑥_{1}^{2}$ is constant w.r.t. $𝑥_{2}$ , its derivative is 0:

\frac{𝜕}{𝜕 𝑥_{2}} (𝑥_{1}^{2}) = 0

The derivative of $𝑥_{2}^{2}$ w.r.t. $𝑥_{2}$ is:

\frac{𝜕}{𝜕 𝑥_{2}} (𝑥_{2}^{2}) = 2 𝑥_{2}

So,

\frac{𝜕 𝑓}{𝜕 𝑥_{2}} = 2 𝑥_{2}

Combine to form gradient

\nabla 𝑓 (𝑥_{1}, 𝑥_{2}) = (2 𝑥_{1}, 2 𝑥_{2})

Evaluate at a point

𝑥 = (1, 2)

\nabla 𝑓 (1, 2) = (2 \times 1, 2 \times 2) = (2, 4)

Interpretation

The gradient vector $(2, 4)$ at $(1, 2)$ points in the direction where the function $𝑓$ increases most rapidly
Its magnitude $\sqrt{2^{2} + 4^{2}} = \sqrt{20} \approx 4.47$ measures the steepness of that increase
Moving one unit along the $𝑥_{1}$ axis alone would increase the function by the partial derivative with respect to $𝑥_{1}$ , which is 2
Moving one unit along the $𝑥_{2}$ axis alone would increase the function by the partial derivative with respect to $𝑥_{2}$ , which is 4
Moving from $(1, 2)$ in the direction of $(2, 4)$ , the function value will rise fastest and at a rate roughly proportional to $4.47$ per unit distance moved

297.0.1. Algorithm

Find the minimum of a function

In gradient descent, we move in the opposite direction of the gradient to minimize a function

Direction of maximum increase:

\nabla 𝑓 (𝑥)

Direction of maximum decrease:

- \nabla 𝑓 (𝑥)

Initialize: Start with an initial guess for the parameters
Compute Gradient: Find the gradient of the function at the current parameters
Update Parameters: Adjust the parameters by moving in the opposite direction of the gradient, scaled by the learning rate
Repeat: Continue the process until the parameters converge to a minimum or the changes are minimal

Update Rule:

𝑥^{𝑘 + 1} \leftarrow 𝑥^{𝑘} - 𝛼_{𝑘} \nabla 𝑓 (𝑥^{𝑘})

Where:

$𝑥^{𝑘}$ : current iterate (point in domain of $𝑓$ )
$𝛼_{𝑘}$ : step size or learning rate at iteration $𝑘$
$\nabla 𝑓 (𝑥^{𝑘})$ : gradient of the function $𝑓$ with respect to $𝜃$
$𝑥^{𝑘 + 1}$ : next iterate (point in domain of $𝑓$ )

gradient_descent.py

import numpy as np

def gradient_descent(
  f,              # Function to minimize, f(x)
  grad_f,         # Gradient of f, grad_f(x)
  x0,             # Initial guess for x
  alpha=0.01,     # Learning rate
  max_iter=1000,  # Maximum number of iterations
  tol=1e-6        # Tolerance for stopping
):
    x = x0.copy()
    for _ in range(max_iter):
        grad = grad_f(x)
        if np.linalg.norm(grad) < tol:
            break
        x -= alpha * grad
    return x

def f(x):
  """Example: minimize f(x) = x^2 + 2x + 1"""
  return x**2 + 2*x + 1

def grad_f(x):
    """Gradient of f(x) = x^2 + 2x + 1 is f'(x) = 2x + 2"""
    return 2*x + 2

gradient_descent(
    f,
    grad_f,
    x_initial,
    alpha=0.1,
    max_iter=100
)

Example

Function

𝑥_{1}^{2} + 𝑥_{2}^{2}

Gradient

\nabla 𝑓 (𝑥) = (\frac{𝜕 𝑓}{𝜕 𝑥_{1}}, \frac{𝜕 𝑓}{𝜕 𝑥_{2}}) = (2 𝑥_{1}, 2 𝑥_{2})

Learning Rate

𝛼 = 0.1

Iteration 1

Initial Parameter Values:

𝑥_{1} = 1 𝑥_{2} = 2

Gradient Calculation

\nabla 𝑓 (𝑥) = (2 \times 1, 2 \times 2) = (2, 4)

Parameter Update:

𝑥 \leftarrow 𝑥 - 𝛼 \nabla 𝑓 (𝑥)

\begin{matrix} 𝑥_{1} \leftarrow 1 - 0.1 \times 2 = 0.8 \\ 𝑥_{2} \leftarrow 2 - 0.1 \times 4 = 1.6 \end{matrix}

Updated Parameter Values:

𝑥_{1} = 0.8 𝑥_{2} = 1.6

Iteration 2

Current Parameter Values

𝑥_{1} = 0.8 𝑥_{2} = 1.6

Gradient Calculation:

\nabla 𝑓 (𝑥) = (2 \times 0.8, 2 \times 1.6) = (1.6, 3.2)

Parameter Update:

𝑥 \leftarrow 𝑥 - 𝛼 \nabla 𝑓 (𝑥)

\begin{matrix} 𝑥_{1} \leftarrow 0.8 - 0.1 \times 1.6 = 0.64 \\ 𝑥_{2} \leftarrow 1.6 - 0.1 \times 3.2 = 1.28 \end{matrix}

Updated Parameter Values:

𝑥_{1} = 0.64 𝑥_{2} = 1.28

Iteration 3

Current Values

𝑥_{1} = 0.64 𝑥_{2} = 1.28

Gradient Calculation:

\nabla 𝑓 (𝑥) = (2 \times 0.64, 2 \times 1.28) = (1.28, 2.56)

Parameter Update:

𝑥 \leftarrow 𝑥 - 𝛼 \nabla 𝑓 (𝑥)

\begin{matrix} 𝑥_{1} \leftarrow 0.64 - 0.1 \times 1.28 = 0.512 \\ 𝑥_{2} \leftarrow 1.28 - 0.1 \times 2.56 = 1.024 \end{matrix}

Updated Parameter Values:

𝑥_{1} = 0.512 𝑥_{2} = 1.024

Example

Problem Setup

Minimize the quadratic function:

min_{𝑥 \in ℝ^{2}} 𝑓 (𝑥) = 4 𝑥_{1}^{2} - 4 𝑥_{1} 𝑥_{2} + 2 𝑥_{2}^{2} where 𝑥 = [\begin{matrix} 𝑥_{1} \\ 𝑥_{2} \end{matrix}]

Gradient of $𝑓 (𝑥)$

The gradient is:

\nabla 𝑓 (𝑥) = [\begin{matrix} \frac{𝜕 𝑓}{𝜕 𝑥_{1}} \\ \frac{𝜕 𝑓}{𝜕 𝑥_{2}} \end{matrix}]

Step 1: Compute $\frac{𝜕 𝑓}{𝜕 𝑥_{1}}$

Take the derivative of each term with respect to $𝑥_{1}$ (treat $𝑥_{2}$ as a constant):

$\frac{𝜕}{𝜕 𝑥_{1}} (4 𝑥_{1}^{2}) = 8 𝑥_{1}$
$\frac{𝜕}{𝜕 𝑥_{1}} (- 4 𝑥_{1} 𝑥_{2}) = - 4 𝑥_{2}$
$\frac{𝜕}{𝜕 𝑥_{1}} (2 𝑥_{2}^{2}) = 0$

So:

\frac{𝜕 𝑓}{𝜕 𝑥_{1}} = 8 𝑥_{1} - 4 𝑥_{2}

Step 2: Find $\frac{𝜕 𝑓}{𝜕 𝑥_{2}}$

Take the derivative of each term with respect to $𝑥_{2}$ (treat $𝑥_{1}$ as a constant):

$\frac{𝜕}{𝜕 𝑥_{2}} (4 𝑥_{1}^{2}) = 0$
$\frac{𝜕}{𝜕 𝑥_{2}} (- 4 𝑥_{1} 𝑥_{2}) = - 4 𝑥_{1}$
$\frac{𝜕}{𝜕 𝑥_{2}} (2 𝑥_{2}^{2}) = 4 𝑥_{2}$

So:

\frac{𝜕 𝑓}{𝜕 𝑥_{2}} = - 4 𝑥_{1} + 4 𝑥_{2}

So the gradient of $𝑓 (𝑥)$ is:

\nabla 𝑓 (𝑥) = [\begin{matrix} 8 𝑥_{1} - 4 𝑥_{2} \\ - 4 𝑥_{1} + 4 𝑥_{2} \end{matrix}]

Optimal Solution

\begin{matrix} 𝑥^{*} = [\begin{matrix} 0 \\ 0 \end{matrix}] \\ 𝑓 (𝑥^{*}) = 0 \end{matrix}

Gradient Descent Iterations

Iteration 1

Initial guess:

𝑥^{0} = [\begin{matrix} 2 \\ 3 \end{matrix}] \Rightarrow 𝑓 (𝑥^{0}) = 4 (2^{2}) - 4 (2) (3) + 2 (3^{2}) = 10

Gradient at $𝑥^{0}$ :

\nabla 𝑓 (𝑥^{0}) = [\begin{matrix} 8 (2) - 4 (3) \\ - 4 (2) + 4 (3) \end{matrix}] = [\begin{matrix} 4 \\ 4 \end{matrix}]

Line Search:

\begin{aligned} 𝑥 (𝛼_{0}) = 𝑥^{0} - 𝛼_{0} \nabla 𝑓 (𝑥^{0}) & = [\begin{matrix} 2 \\ 3 \end{matrix}] - 𝛼_{0} [\begin{matrix} 4 \\ 4 \end{matrix}] \\ = [\begin{matrix} 2 - 4 𝛼_{0} \\ 3 - 4 𝛼_{0} \end{matrix}] \end{aligned}

\begin{aligned} 𝑓 (𝑥 (𝛼_{0})) & = 4 {𝑥_{1}}^{2} - 4 𝑥_{1} 𝑥_{2} + 2 {𝑥_{2}}^{2} \\ = 4 {(2 - 4 𝛼_{0})}^{2} - 4 (2 - 4 𝛼_{0}) (3 - 4 𝛼_{0}) + 2 {(3 - 4 𝛼_{0})}^{2} \\ = 32 𝛼_{0}^{2} - 32 𝛼_{0} + 10 \end{aligned}

Minimizing:

𝛼_{0} = argmin 𝑓 (𝑥 (𝛼))

\begin{aligned} \frac{𝜕}{𝜕 𝛼} 𝑓 (𝑥 (𝛼_{0})) & = \frac{𝜕}{𝜕 𝛼} (32 𝛼_{0}^{2} - 32 𝛼_{0} + 10) = 0 \\ = 64 𝛼_{0} - 32 = 0 \end{aligned}

𝛼_{0} = \frac{1}{2}

Update Step:

𝑥^{1} = 𝑥^{0} - 𝑎_{0} \nabla 𝑓 (𝑥^{0}) = [\begin{matrix} 2 \\ 3 \end{matrix}] - \frac{1}{2} [\begin{matrix} 4 \\ 4 \end{matrix}] = [\begin{matrix} 0 \\ 1 \end{matrix}]

Check Progress:

New point:

𝑥^{1} = [\begin{matrix} 0 \\ 1 \end{matrix}]

Function value decreases $𝑓 (𝑥^{𝑘 + 1}) < 𝑓 (𝑥^{𝑘})$

𝑓 (𝑥^{1}) = 2 < 𝑓 (𝑥^{0}) = 10

Gradient at new point:

\nabla 𝑓 (𝑥^{1}) = [\begin{matrix} 8 (0) - 4 (1) \\ - 4 (0) + 4 (1) \end{matrix}] = [\begin{matrix} - 4 \\ 4 \end{matrix}]

Gradient magnitude $‖ \nabla 𝑓 (𝑥^{𝑘 + 1}) ‖ < ‖ \nabla 𝑓 (𝑥^{𝑘}) ‖$

‖ \nabla 𝑓 (𝑥^{1}) ‖ = 4 \sqrt{2} = ‖ \nabla 𝑓 (𝑥^{0}) ‖

Note: Gradient magnitude stays the same in this iteration due to the specific structure of this quadratic function

Iteration 2

New Point:

𝑥^{1} = [\begin{matrix} 0 \\ 1 \end{matrix}]

Gradient at $𝑥_{1}$ :

\nabla 𝑓 (𝑥^{1}) = [\begin{matrix} 8 (0) - 4 (1) \\ - 4 (0) + 4 (1) \end{matrix}] = [\begin{matrix} - 4 \\ 4 \end{matrix}]

𝑎_{1} = argmin 𝑓 (𝑥 (𝛼_{1}))

Line search:

\begin{aligned} 𝑥 (𝛼_{1}) & = 𝑥^{1} - 𝛼_{1} \nabla 𝑓 (𝑥^{1}) & = [\begin{matrix} 0 \\ 1 \end{matrix}] - 𝛼_{1} [\begin{matrix} - 4 \\ 4 \end{matrix}] \\ = [\begin{matrix} 4 𝛼_{1} \\ 1 - 4 𝛼_{1} \end{matrix}] \end{aligned}

\begin{aligned} 𝑓 (𝑥 (𝛼_{1})) & = 4 {𝑥_{1}}^{2} - 4 𝑥_{1} 𝑥_{2} + 2 {𝑥_{2}}^{2} \\ = 4 {(4 𝛼_{1})}^{2} - 4 (4 𝛼_{1}) (1 - 4 𝛼_{1}) + 2 {(1 - 4 𝛼_{1})}^{2} \\ = 160 𝛼_{1}^{2} - 32 𝛼_{1} + 2 \end{aligned}

Minimize:

𝛼_{1} = argmin 𝑓 (𝑥 (𝛼_{1})) = \frac{1}{10}

\begin{aligned} \frac{𝜕}{𝜕 𝛼} 𝑓 (𝑥 (𝛼_{1})) & = \frac{𝜕}{𝜕 𝛼} (160 𝛼_{1}^{2} - 32 𝛼_{1} + 2) = 0 \\ = 320 𝛼_{1} - 32 = 0 \end{aligned}

𝛼_{1} = \frac{1}{10}

Update Step:

𝑥^{2} = 𝑥^{1} - 𝛼_{1} \nabla 𝑓 (𝑥^{1}) = [\begin{matrix} 0 \\ 1 \end{matrix}] - \frac{1}{10} [\begin{matrix} - 4 \\ 4 \end{matrix}] = [\begin{matrix} 0.4 \\ 0.6 \end{matrix}]

Improvement:

New point:

𝑥_{2} = [\begin{matrix} 0.4 \\ 0.6 \end{matrix}]

Function value decreases $𝑓 (𝑥^{𝑘 + 1}) < 𝑓 (𝑥^{𝑘})$

𝑓 (𝑥^{2}) = 0.4 < 𝑓 (𝑥^{1}) = 2

Gradient at new point:

\nabla 𝑓 (𝑥_{2}) = [\begin{matrix} 8 (0.4) - 4 (0.6) \\ - 4 (0.4) + 4 (0.6) \end{matrix}] = [\begin{matrix} 0.8 \\ 0.8 \end{matrix}]

Gradient magnitude $‖ \nabla 𝑓 (𝑥^{𝑘 + 1}) ‖ < ‖ \nabla 𝑓 (𝑥^{𝑘}) ‖$

‖ \nabla 𝑓 (𝑥^{2}) ‖ = ‖ (0.8, 0.8) ‖ = \frac{4 \sqrt{2}}{5}

Example

Problem Setup

Minimize the quadratic function:

min_{𝑥 \in ℝ^{2}} 𝑓 (𝑥) = 𝑥_{1}^{2} - 2 𝑥_{1} 𝑥_{2} + 2 𝑥_{2}^{2} + 2 𝑥_{1} where 𝑥 = [\begin{matrix} 𝑥_{1} \\ 𝑥_{2} \end{matrix}]

Gradient of $𝑓 (𝑥)$

\nabla 𝑓 (𝑥_{1}, 𝑥_{2}) = [\begin{matrix} 2 𝑥_{1} - 2 𝑥_{2} + 2 \\ - 2 𝑥_{1} + 4 𝑥_{2} \end{matrix}]