An intuitive and visual guide to the mathematics behind gradient descent

Without any calculus, you will be able to understand the actual mathematics behind the concept of gradient descent.

10 min readMar 13, 2022

Reason for this article

I have two issues with most posts that I’ve encountered on the topic of gradient descent. These are:

articles focus too much on the algorithms that minimize the cost function (which in my opinion, are more of an explanation over algorithms available to achieve minimization, not an explanation of the gradient as a concept)
or define the gradient through mathematical expressions, which while most accurate, limits the ease with which you can understand the concept.

Hence, this blog is my attempt to provide a more conceptual understanding of the gradient.

Let’s put aside the concept of a cost function or algorithms to descend the gradient, and look at what the gradient is in a more abstract way.

We will at the end bring those concepts back with a few words, but for now, let's try to understand visually what the gradient is.

Let's not waste any time and see what needs to be seen; a 3D visualization that explains the gradient by Eugene Khutoryansky

The above video ties the mathematical definition of the gradient to a 3D visual, however, you may need some help breaking the concepts down. The rest of this blog does this, and by the end, you should be able to re-watch the above video and have everything click.

Back to the basics

You probably are familiar with the functions of the f(x)=y type. This type of function takes one input (x) and yields one output (y). A popular function to use as example when discussing gradient descent is the convex function, the “U”- shaped one.

Often, at this point explanations will show how you can jump around the convex curve downwards and how this action is the definition of gradient descent (sometimes gravity is somehow even brought into the explanation 🙅‍♂️). While that is an okay-sh idea, it leaves so much insight on the table.

So what is the gradient? Let's start with a formal-ish definition and break it down with the convex curve as our base scenario.

The gradient is at any point in the input space, the direction that maximizes the distance traveled in the output space, per unit of movement in the input space.

We will now break down the above definition into pieces, defining what is meant as “space”, “distance traveled”, “etc”.

For those who already know calculus:
The following sections describes the derivative of a function. “The distance traveled in the output space, per unit of movement in the input space” is the rate of change of our function, also known as a derivative.

Defining “Space”

Let’s digest the statement together. First, the statement refers to two types of spaces, the input space, and the output space. So, what exactly are these spaces?

The input space is all possible values of our input x. Ergo, the pink x-axis is our input space, as the x-axis capture all possible values that x can take, and for now (again assuming 2D scenario), x is our only input variable. While the output space is all possible values of our output y or f(x).

We now understand what the definition refers to when it talks about space. The input space is all possible values of our input x, and output space is all possible values of our output variable y.

Now, what does “the distance traveled in the output space per unit of movement in the input space” even mean?

While it sounds complicated, this is one of those occasions where describing things with a bit of math makes them clearer.

Defining “Distance traveled”

Let us define “distance traveled” in the input space as the change between two instances of x.

We write “change” with the symbol δ. It helps if you read δ as “change in” or “distance traveled in”. Thus, distance traveled (or δ) for each variable can be defined as;

This makes sense, as we are going from one point of x to another point of x, and thus, traveling a distance between two points in our input space x.

Similarly, let us define “distance traveled” in the output space as a change between two instances of y.

Don't forget that any x has an associated value of y, as y is what we obtain after we transform x through the function f(x). Therefore, the moment we define the values that x is changing from and into, we obtain by default the values of change of y.

From the figures below, we can observe how changing x by δx, results in a distance traveled of y = δy.

We provide two images to help you see how where δx is, determines where δy lands. Look for δy as the highlighted section in the y-axis.

Note that we purposely let δx be the same in magnitude (size) across both visuals, taking the value of a unit of 1. Also note that despite δx being of the same magnitude, the created δy by each δx is different in magnitude. This will become important later.

Defining “Distance traveled δy per unit of δx ”

Knowing that any δx is associated with a specific δy helps us makes sense of the following part of our definition; “the distance traveled in the output space per unit of movement in the input space”. As any distance traveled in the output space (δy) is created by a distance traveled in the input space (δx).

In order to better talk about any δy size for a given δx, let’s simply capture this relationship as the ratio δy/δx. Thus, we can express “the distance traveled in the output space per unit of movement in the input space” as simply δy/δx.

Note *δy of different lengths are created by the same δx, driven by the how much δy you get per δx is not constant across the function.*

Without worrying about the actual values of δy/δx that we have created through our examples, we can hopefully agree that because one δy is larger than the other, and both δx are of equal length, it follows that one δy/δx is greater than the other.

“Distance traveled δy per unit of δx ” can vary

Let's create a color code to visually capture the difference between possible values of δy/δx, by denoting in red greater values of δy/δx.

Now, we’ve only calculated two instances of δy/δx, so lets color code our visual to capture the values of δy/δx in it.

Lets, first color code δy/δx in the output space (δy), and then map these in the input space (δx). By map, we mean to color code the δx with their associated δ/δx

We are color-coding the *δy/δx in output space and then in input space.*

“Distance traveled δy per unit of δx ” = Slope

Another way to look at δy/δx is in terms of the slope of a function. This perspective will be helpful to understand the gradient. The slope is defined as the rise over the run. In our example, δy is our rise, while δx is our run, and therefore δy/δx is our slope. The slope is thus equivalent to the change in the output space per unit of movement in the input space.

For our current scenario, we take a mental shortcut and simply put a tangent line to the two sections of our functions governed by the two instances of δx. We can see that one slope is greater (steeper) than the other, exactly as we would expect from our so far logical exercise.

Even better, instead of working with arbitrary units of δx, we can make δx small enough that all curves in a graph will approximate a linear function (ignoring stochastic functions, beyond this blog).

Note how in the visual below, if we zoom in δx very small, the function becomes linear. By taking advantage of this trick, we can calculate δy/δx across the entire function.

If we zoom enough, that is *δx is made very small, the function becomes almost linear, and our slope definition is quite accurate.*

Using the trick we just saw, we now know that δy/δx will visually be represented as tangent lines to each point in the graph, as we are just finding the slope at all points.

The figure below on the left displays many (but not all) slopes/tangents that make our function. Given that this is messy to read, let's map the now found slope values into our x-input space as we did before.

The below figure on the right color-codes the input space according to the magnitude of the slope that each now very small instance of δx creates.

We find that δy/δx increases as δx occurs at higher absolute magnitudes of x. This should not surprise us, as we can see that the U-shape function becomes steeper the further from x=0 we get.

Putting what we have learned so far together,

In case you are confused, let's break down what we have done so far. We were presented with the below definition of the gradient.

We discovered that the input space referred simply to our x-axis, while the output space to the output function space, in our case living in the y-axis space.

We then saw that each change in x caused a change in y and that the relationship between the changes of these two variables was not necessarily constant. By that, I am referring to the idea that δy/δx as a ratio, was not the same across the function.

We drove this point home by visually displaying δy/δx as a slope, which is possible to do if we make δx very small and zoom in to each point in the graph. We also captured the “magnitude” of the δy/δx or slope in the input space, in our case, color-coding the x-axis.

Right now, the concept of the “direction that maximizes” movement in the output space per movement in the input space is not very approachable, as we are dealing in a 2D scenario, and in this scenario, at all points, we can only move in the “left” and “right” direction.

Defining “Direction that maximizes δf(x,y)/δinput”

So far we have worked in a two-dimensional space, where y was our output function and x was our input variable. This made things easy to create an understanding of the above concepts. However, to grasp the concept of direction, we need to complicate things by adding more input variables.

Let's now consider a function f(x,y) of two variables x and y. The variables x and y “live” in a two-dimensional input-space, and together, create a third output in the z-space as it is defined by their interaction.

Let's say for example that the way x and y interact with each other to create z is as follows:

The equation does not matter as much as the visual created

The concepts that we have covered before continue to apply. The input space is not governed by the coordinates of Y, X and the output space is governed by all possible values of Z.

We can calculate both δz/δx and δz/δy. Each is the change in the output space per unit of movement in a direction within the input space. With the minor difference that instead of just one input space, we have now two input spaces (x and y). For this reason, we can calculate changes in the output spaces relative to a single input space.

If you have taken calculus, we call δz/δx and δz/δy “partial derivatives” because we are taking slopes in the output space relative to the direction of only one of our input space directions.

However, we can also calculate the following: δz/(δy+δx).

This is the change in the output space caused by a movement in the input space, where the input space movement is not limited to one unique direction (x or y) but instead a combination of the two.

While technically when starting at (X, Y) you could move at a 360 angle, one direction should maximize the amount of movement you can get of Z (assuming you are not in a flat space, where all directions get you equal change on Z). We call this direction the gradient.

Going back to our previous definition:

We can now talk about the direction as the combination of δy and δx that cause δz to change the fastest. In ML, we are often talking about decreasing δz the fastest.

Now, calculating the gradient is a computing heavy exercise, but visualizing it is not. Just as we color-coded the slope over the 2D visuals, we can color code the slope over the 3D visual. Instead of projecting this color in the “input space”, the image below lets it rest in the “output space Z” (i.e. the function z)

You should now be able to revisit the video posted at the beginning of this blog and be able to understand the key concepts. If you are successful, you know understand the mathematics behind gradient descent!

In case you want additional mathematical breakdown: