Supervised Learning II – Linear Regression

Linear regression is one of the simplest supervised learning algorithms. Here, the goal is not to predict the class, or category, of a given example but to predict a real numerical value for it.

It can be represented formally by \(Y=W^{T}X\) where \(Y=[y_1, y_2, …, y_m]\) is the output vector, \(W=[w_1, w_2, …, w_n]^{T}\) is the weight vector and \(X=[x_1, x_2, …, x_m]\) where \(x_c=[x_{c1}, x_{c2}, …, x_{cn}]^{T}\)is the data matrix. In this notation, Y, W and X have sizes (1xm), (nx1) and (nxm), respectively, where m is the number of examples and n is the number of features. If you do the matrix multiplication above, you see that each entry in results from weighted sum of the features of the corresponding example. The figure below gives a representation of the algorithm.

 

So far so good. Next, we should determine a cost function that we can minimize to train the algorithm. In general, sum of the squared errors (SSE) is used as cost function in linear regression:

\(J(W)=\frac{1}{2}\sum_{i}(h_{W}(x^{(i)})-y^{(i)})^{2}\) where

\(h_{W}(x^{(i)})=W^{T}x^{(i)}\).

The term inside the parentheses simply gives the difference between the predicted value and the actual value of an example. To obtain the overall error on the whole dataset, we sum up the errors made on each example. However, before summing up the individual errors, we square them to get rid of the effect of the sign. Otherwise, negative and positive errors would suppress each other. Taking half of the overall error has no real meaning, it’s just to make the differentiation easier.

As mentioned in the introductory post of the series, we should now take partial derivative of this cost function with respect to each weight. We can utilize chain rule here.

Forget the summation operator for now, we can integrate it later.

If \(W^{T}x^{(i)}=z\) then \(J(W)=\frac{1}{2}(z-y^{(i)})^{^2}\) and \(\frac{\partial J}{\partial W}=\frac{\partial J}{\partial z}\frac{\partial z}{\partial W}\).

\(\frac{\partial J}{\partial z}=(z-y^{(i)})\) and \(\frac{\partial z}{\partial W}=x^{(i)}\).

Thus, \(\frac{\partial J}{\partial W}=(z-y^{(i)})x^{(i)}=(W^{T}x^{(i)}-y^{(i)})x^{(i)}\). Now, we can plug in the summation operator:

\(\frac{\partial J}{\partial W}=\sum_{i}(W^{T}x^{(i)}-y^{(i)})x^{(i)}\) which can be re-written as

\(\frac{\partial J}{\partial W}=X(W^{T}X-Y)^{T}\).

Each entry of the resulting vector \(\frac{\partial J}{\partial W}\) is the partial derivative of the cost function with respect to the corresponding weight.

Now that we’ve calculated the partial derivatives, we can run gradient descent algorithm to update the weights. I implemented the algorithm in Python and tested it on the concrete compressive strength dataset using 10-fold cross validation. The figure below shows the actual and the predicted strength of the concrete in y-axis for one of the folds. The x-axis is the sample id.

As can be seen, the algorithm is not very successful in predicting the strength value. However, it can grasp the trend mostly, that is, if the actual strength of a sample is large relative to other samples, the algorithm’s prediction is also large.

There are a couple of things worth mentioning here. First, the algorithm is pretty sensitive to learning rate as expected. Second, when do we stop gradient descent ? We didn’t mention this problem in the previous post. There’re two approaches that are commonly adopted: if the number of epochs reaches a pre-determined threshold, say 10000, or if the difference between costs that the cost function yields in two successive epochs is smaller than some pre-determined threshold, say 0.0001, we stop training the algorithm. An epoch is a step where we determine the partial derivatives using all of the training examples.

Third, how does linear regression algorithm responds to standardization (z-score) ? Turns out, normalization doesn’t effect the prediction performance considerably in linear regression. The prediction performance can be measured using mean squared error (MSE). After adjusting the learning rate for the normalized and non-normalized data, we see that the algorithm gives average MSE (averaged on folds) around 110 for both cases.

 

Leave a Reply

Your email address will not be published. Required fields are marked *