The task of supervised learning is to find a function that predicts the category of a given data *well*. Here, the word *well *has a relative meaning and what we actually do is to try to find a function that matches data to its class labels, or target values in regression, as best as it can. But how can we define what actually *well* means ? For that, we need a scoring function that gives us a score on how *well* we’re doing. This scoring function is called the ** cost function**. By minimizing this function (or maximizing it depending on what the function is) supervised learning algorithms try to find a matching that yields as few errors as it can.

Many supervised learning algorithms, like linear regression, logistic regression, neural networks etc., use differentiation to minimize the cost function. Remember how we find the minimum or maximum of a function, we set its first derivative to 0 and solve for the unknowns. So, we should find the derivative of the cost function. But with respect to what ? To answer this question we need to visualize the cost function in ** weight space**.

Not surprisingly, the mentioned algorithms perform learning by adjusting their parameters so that at the end of the adjusting process the algorithm makes as few errors as it can. The above graph shows a snapshot of this *weight adjusting* or *learning *process in the *weight space*. Axes represent the weights of the model and the total cost so that a point in this space, shown by the blue ball, corresponds to a choice of weights and the cost resulting from this choice. Eventually, the goal is to take the ball to a pit which represents a low-error region.

One way of doing this is taking the partial derivative of the cost function with respect to each weight, setting these derivatives to 0 and solving for the unknowns. This may work when the cost function is linear in weights. Still, if the number of weights is large it would be unfeasible to do so. Then there are models, like neural networks, of which the cost function is not linear in weights and there may not be a solution to the system obtained by setting the first derivatives to 0.

There is another way of taking the ball to a pit, other than the analytical solution. By taking the partial derivative of the cost function with respect to a weight we can understand if the cost is increasing or decreasing with the increase of the weight and move the ball in the direction where the cost would be smaller. A differentiable function always increases in the direction of its derivative and decreases in the opposite direction. Thus, we should update the weight in the negative direction of the partial derivative. Of course, we should do this for all of the weights of the model and move accordingly. By now it should be clear that moving the ball corresponds to updating the values of the weights.

This process of determining the direction and updating the weights accordingly is called ** gradient descent**. Gradient descent is an iterative optimization algorithm and, in this context, its purpose is to move the ball to a

*minimum*,

*be it global or local, hence the name. To show how gradient descent works, we can run it on a simple function like a parabola. The figure below shows the plot of the parabola whose equation is \(f(x)=x^2-3x-25\).*

This function has only one minimum, so it’s a global one. If it had’ve multiple minimums, the smallest one would be the global minimum and the other ones would be the local minimums.

This figure can be obtained using the Python code below.

import numpy as np import matplotlib.pyplot as plt X = np.linspace(-40, 40, num = 810) Y = np.power(X,2) - 3 * X - 25 plt.plot(X,Y) plt.xlim(-30,30) plt.ylim(-40,100) plt.grid() plt.axhline(0, color = "black") plt.axvline(0, color = "black") plt.show()

As mentioned before, our goal is to find the minimum of this function. For the sake of simplicity, we don’t have weights in this example, thus, we seek the value of x where the value of *f(x)* is the smallest. Regardless, the same logic applies to cases where we have weights.

In the beginning, we choose an arbitrary value for x, let’s say 13. The next step is to determine the derivative of the function at 13, which is *2 x 13 – 3 = 23*. Now, we should update x according to *x = 13 – 23 = -10*. We repeat the previous two steps and we obtain the following x values 13, -10, 13, -10, … As you can see, the ball oscillates between 13 and -10. To prevent this, we should update x with smaller amounts. We can do this by multiplying the derivative with a small number, say 0.05. Now, our gradient descent algorithm becomes \(x^{n+1} = x^n-\alpha f'(x)\).

Here, \(\alpha\) is called the *learning rate*. This is one of the hyper-parameters that several supervised learning algorithms use and should be determined manually before running the algorithm.

With this new change, the algorithm finds, actually approximates to, the minimum. The animation below shows the position of the ball during the algorithm.

This animation can be obtained using the Python code below.

import numpy as np import math import matplotlib.pyplot as plt import matplotlib.animation as animation alpha = 0.05 X = np.linspace(-11, 14, num = 810) Y = np.power(X,2) - 3 * X - 25 ball_x = 13 ball_y = math.pow(ball_x, 2) - 3 * ball_x - 25 fig = plt.figure() plt.grid() plt.axhline(0, color = 'black') plt.axvline(0, color = 'black') ax = plt.axes() ax.axis('equal') plt.plot(X,Y) patch = plt.Circle((ball_x, ball_y), radius = 2, fc='r') def init(): patch.center = (ball_x, ball_y) ax.add_patch(patch) return patch, def animate(i): global ball_x global ball_y ball_x = ball_x - alpha * (2 * ball_x - 3) ball_y = math.pow(ball_x, 2) - 3 * ball_x - 25 patch.center = (ball_x, ball_y) return patch, anim = animation.FuncAnimation(fig, animate, init_func = init, frames = 300, interval = 100, blit = True) # anim.save('grad.mp4', fps=60) plt.show()

If you print the values of the variables *ball_x* and *ball_y* to the screen in the *animate* function, you see that it takes ball to pretty long to get to the minimum, which is (1.5, -27.25). We can speed this up by increasing the learning rate to, say 0.3. If we increase it too much, say 0.9, it would follow a zig-zag path. If we increase it further, say 1.1, it would get further from the minimum.

Because the parabola function is a pretty simple one to tackle with, we can adjust the learning rate without much work. In reality, however, the cost functions are multi-dimensional and are not continuous, there are bumps, in that, even a tiny change in weights may cause big difference in the error. The learning rate may be adjusted by trial and error or automatically, depending on the problem at hand and the learning algorithm used. Automatic approaches include AdaDelta, AdaGrad, RMSProp, etc. In these approaches, the learning rate is adjusted dynamically during the learning process.