Gradient descent

Gradient descent is a commonly used optimization algorithm in machine learning (as well as in other fields). Generally speaking, gradient descent is an algorithm that minimizes functions. For a given a function and an initial input value to it, gradient descent iteratively moves towards an input value that minimizes the function, i.e. optimally the gradient descent algorithm will terminate with a value where the function in question is at a minimum.

Intuitively the gradient descent algorithm works by always following the steepest descending "slope" until a "level area" - or a minimum is reached*. * Unfortunately the obtained minimum can be just one of many minima, i.e. there is no guarantee that we end up in the global minimum when greedily following the steepest descent. Later we'll look into methods that can be used to mitigate this. In order to gain full appreciation of gradient descent optimization we need to dive into the gory mathematical details of the gradient. In the following, concepts are introduced in their general, multi-dimensional form but exemplified in 2 and 3 dimensions which hopefully gives a better intuition of what is going on.

Understanding the gradient vector

The formal definition of the gradient of a function $f : \mathbb{R}^{n} \to \mathbb{R}$ at point $x \in \mathbb{R}^{n}$, denoted $\nabla f(x)$, is:

$$\begin{equation} \label{eq:gradient_definition} \nabla f(x) = \begin{bmatrix} \frac{\partial f(x)}{\partial x_1} \\ \vdots \\ \frac{\partial f(x)}{\partial x_n} \end{bmatrix} \end{equation}$$

I.e. the gradient is an $n$-dimensional vector where the $i$-th element is the partial derivative† of $f(x)$ with respect to $x_i$, † A partial derivative of a function of several variables is its derivative with respect to one of those variables, with the others held constant. where $x = (x_1, \ldots, x_n).$ Also $f$ must be differentiable in order for the partial derivatives to exist.

An intuitive interpretation of the gradient vector at a specific point $x$ is that it's pointing in the direction of greatest increase of the function $f$, as each element of the gradient tells you how fast $f(x)$ is changing with respect to the standard basis. One then might wonder, if $f$ could be changing faster with respect to some other direction, but indeed it can be shown that the gradient vector $\nabla f(x)$ points in the direction of steepest ascent‡. ‡ Proof can be found here. Of course, this also means that the opposite direction, $−\nabla f(x)$, is the direction of steepest descent.

In summary, the gradient is a vector whose elements are the slope of the function along each of the coordinate axes, i.e. it's a multidimensional generalisation of the usual derivative of a one-dimensional function of a real variable.

The gradient vector explained by an example

Before continuing to the gradient descent algorithm let's have a look at the gradient vector for a function of two variables, and see how the gradient relates to its 3-dimensional surface plot. As an example we'll use an elliptic paraboloid function with domain $\mathbb{R}^{2}$:

$$\begin{equation} \label{eq:example_function} f(x,y) = 6x^{2} + 4y^{2} - 4xy \end{equation}$$

Where the partial derivatives with respect to $x$ and $y$ are

$$\begin{equation} \label{eq:example_function_derivative_x} \frac{\partial f(x,y)}{\partial x} = 12x - 4y \notag \end{equation}$$

and

$$\begin{equation} \label{eq:example_function_derivative_y} \frac{\partial f(x,y)}{\partial y} = 8y - 4x \notag \end{equation}$$

Which according to \eqref{eq:gradient_definition} gives us a gradient vector of

$$\begin{equation} \label{eq:example_function_gradient} \nabla f(x,y) = \begin{bmatrix} 12x - 4y \\ 8y - 4x \end{bmatrix} \end{equation}$$

Figure 1 below shows the graph of our example function, \eqref{eq:example_function}, plotted along with its gradient vector field given by \eqref{eq:example_function_gradient}.

Figure 1: Example function and its gradient vector field

The length and direction of the gradient vectors correspond to the slope of the function at the point of origin of each vector. It can be seen that the vectors die off around the "bottom" of the surface plot, culminating in a zero gradient at the (global) minimum located at $(x, y) = (0, 0)$.

Projected onto the $(x, y)$ plane you also see the contours of the function, where the the contour lines are isolines of constant function value (i.e. all points at a given contour line are of equal function value). The function contours can be compared to the contour lines known from cartography - on a map these lines join points of equal elevation (height) above a given level. This means that if you walk along a contour line the slope will be zero (you'll be following the side of the valley), and if you walk perpendicular to the contour lines on a map the slope may be quite steep (as you're heading towards either the valley floor or top). And the closer the contour lines are together, the steeper the slope. Again this analogy fits very well with our function plot where the length of the gradient vectors increase as the function contours grow closer together because of the steeper slope. Also the direction of the gradient vectors are orthogonal to the function contours, indicating we're ascending the slope.

The gradient descent algorithm

From the above discussion it should be clear that, starting from some arbitrary location, you can find a (potentially local) minimum for a differentiable function by descending in the opposite direction of the gradient, i.e. in the direction given by $−\nabla f(x)$. Apart from knowing which direction to search next, we also need a step size determining how far we go in that particular direction. This step size needs to be chosen carefully - if it's too large we might actually "bounce out" of our minimum as we'll see in an example below. If, on the contrary, our step size is too small we might descend at an excruciatingly slow pace.

More formally this method can be iteratively described as starting at some arbitrary point $x^{(0)}$ and then, at every iteration $k ≥ 0$, move from point $x^{(k)}$ in the direction given by $\Delta x^{(k)}$ with a step size of $t_k$ to reach the next point $x^{(k+1)} = x^{(k)} + t_k \cdot \Delta x^{(k)}$. As we are doing gradient descent, the direction we move is given by the negative gradient at the point, i.e. $\Delta x^{(k)} = −\nabla f(x^{(k)})$.

In summary, the iterative search of gradient descent can be described via the following recursive rule:

$$\begin{equation} \label{eq:gradient_descent_recursive_rule} x^{(k+1)} = x^{(k)} - t_k \cdot \nabla f(x^{(k)}) \end{equation}$$

Given that our objective is to minimize the function $f$, one reasonable approach is to choose our step size $t_k$ — when at a certain point $x^{(k)}$ — in a manner that will minimize the value of $f$ at the new point $x^{(k+1)}$, i.e. choose the step size that minimizes $f(x^{(k+1)})$. Combining this with \eqref{eq:gradient_descent_recursive_rule} we can define the optimal step size, $t^{optimal}_k$, formally as:

Where $argmin$ is the argument of the minimum. As a simple example, $argmin_x (f(x))$ is the value of $x$ for which $f(x)$ attains its minimum. $$\begin{equation} \label{eq:gradient_descent_step_size} t^{optimal}_k = argmin_{t \geq 0}(x^{(k)} - t \cdot \nabla f(x^{(k)})) \end{equation}$$

For now we'll assume that we are able find an appropriate step size (not necessarily the optimal one as it can be computationally expensive). Below we'll look into a few examples to see the effect of varying the step size.

The algorithm

When a function is at a minimum its gradient will vanish, i.e. if we keep repeating the recursive rule \eqref{eq:gradient_descent_recursive_rule} until the gradient becomes sufficiently small we'll know that we are close to a minimum.

Formally we define the gradient descent algorithm as described below, where the "sufficiently small" from the previous paragraph is formalized as the desired precision $\epsilon > 0.$ As mentioned above, we'll assume that we are able to find a feasible step size $t_k$.

The size of the gradient is given by the vector norm $\lVert \nabla f(x) \rVert$ - for now it suffices to think of the norm as the Euclidean length of the gradient vector.

Gradient descent illustrated by an example

Let's illustrate the gradient descent algorithm by applying it to our example function \eqref{eq:example_function} (defined as $f(x,y) = 6x^{2} + 4y^{2} - 4xy$). We'll start out at $(x, y) = (-20, 0)$, compute the gradient as specified by \eqref{eq:example_function_gradient} and apply the gradient descent algorithm using a fixed step size of $0.02$. The results of the first few iterations of the algorithm are listed in the table below.

$k$	$\frac{\partial f(x,y)}{\partial x}$	$\frac{\partial f(x,y)}{\partial y}$	$x$	$y$	$f(x,y)$
0	N/A	N/A	-20	0	2400
1	-240 $=12\cdot(-20)-4\cdot0$	80 $=8\cdot0-4\cdot(-20)$	-15.2 $=-20-0.02\cdot(-240)$	-1.6 $=0-0.02\cdot80$	1299.2
2	-176 $=12\cdot(-15.2)-4\cdot(-1.6)$	48 $=8\cdot(-1.6)-4\cdot(-15.2)$	-11.68 $=-15.2-0.02\cdot(-176)$	-2.56 $=-1.6-0.02\cdot48$	725.146
3	-129.92	26.24	-9.0816	-3.0848	420.857
4	-96.64	11.648	-7.1488	-3.31776	255.79
5	-72.51456	2.05312	-5.69851	-3.35882	163.404

The numbers from the table are illustrated visually in figure 2 where the red dots in the x,y-plane correspond to the x,y-values calculated for the gradient descent algorithm above, while the red "balls" rolling down the surface towards the minimum of the function represent the values of the function $f$ for these inputs.

Figure 2: Example of gradient descent

If we continue applying the gradient descent algorithm to our example function the resulting x,y-pairs will gravitate towards $(0,0)$ where the minimum of the function is located. This should be fairly evident from the figure above but might be better illustrated by figure 3 that does away with the surface of the function and focuses on its contour map only. The function and gradient descent step size are the same for both figures.

Step size matters

As mentioned above it's not always possible to apply the optimal step size as it might be computationally expensive to calculate it. But we have to be careful when choosing step sizes - if chosen too small our progress might grind to a halt as the gradient tapers off on a plateau, while too large step sizes might hurl us right past the minimum we are searching for. This is illustrated in the figures below that all show the contour map for our example function \eqref{eq:example_function} along with the path traversed by gradient descent at various step sizes.

The first example illustrates gradient descent with a small step size. This descent follows the gradient of our test surface very closely, but also moves very slowly when we get close to the minimum (if we were allowed only to repeat our descent algorithm for a set number of steps, one could easily imagine the algorithm finishing before getting close to the actual minimum).

Figure 3: Gradient descent using a small step size

The next example shows the path traversed by gradient descent using a step size 5 times larger than the step size used above. As we can see this results in a much faster - but also more erratic - descent. Still, as the descent gets "trapped by" the minimum of the function, the steps get smaller as the gradient tends towards $0$ around the minimum.

Figure 4: Gradient descent with an "efficient" step size

The final example illustrates gradient descent using a step size 50% larger than in the previous example. The erratic behaviour has spun completely out of control, hurling the "descent" back and forth, overshooting the minimum while moving ever farther away from it.

Figure 5: Gradient descent using too large a step size

Getting stuck at a local minimum

When working with gradient descent optimization you have to be aware there is no guarantee that the algorithm finds the global minimum - if your function has several minima, gradient descent might as well lead you to a local minimum.

Figure 6: Function with multiple minima

If you consider the function with multiple minima illustrated above, you'll see that gradient descent not necessarily brings you to the global minimum (located approximately at $(0,-2)$). Depending on where you start out, you might as well end in the local minimum on the left (approximately at $(-2,0)$). Or if using a small step size, the algorithm might leave you in the shallow minimum located around $(0,0)$.

References

[Boyd2004] Stephen Boyd & Lieven Vandenberghe, Convex Optimization, Cambridge University Press, 2004
[Joyce2014] David E. Joyce, Math 131 Multivariate Calculus - Directional derivatives, steepest ascent, tangent planes, Clark University, 2014
[Singer2016] Yaron Singer, Advanced Optimization: Gradient descent algorithm - Strong convexity, analysis of convergence, Harvard University, 2016

Activation Functions

Optimization & backpropagation

Mathematical Foundations

Resources & references

\(k\)	\(\frac{\partial f(x,y)}{\partial x}\)	\(\frac{\partial f(x,y)}{\partial y}\)	\(x\)	\(y\)	\(f(x,y)\)
0	N/A	N/A	-20	0	2400
1	-240 \(=12\cdot(-20)-4\cdot0\)	80 \(=8\cdot0-4\cdot(-20)\)	-15.2 \(=-20-0.02\cdot(-240)\)	-1.6 \(=0-0.02\cdot80\)	1299.2
2	-176 \(=12\cdot(-15.2)-4\cdot(-1.6)\)	48 \(=8\cdot(-1.6)-4\cdot(-15.2)\)	-11.68 \(=-15.2-0.02\cdot(-176)\)	-2.56 \(=-1.6-0.02\cdot48\)	725.146
3	-129.92	26.24	-9.0816	-3.0848	420.857
4	-96.64	11.648	-7.1488	-3.31776	255.79
5	-72.51456	2.05312	-5.69851	-3.35882	163.404