Loss Functions and Optimisation

We have an understanding of the linear classifier, but we don't really know how to get the correct $$W$$.

To do:

  1. Define a Loss function that quantifies our unhappiness with the scores across the training data
  2. Come up with parameters that efficiently minimize the loss function (optimization). Ie. searching through the space of possible $$W$$

Loss function

Given a dataset of examples

$${(x_i,y_i)}^N_{i=1}$$

Where $$x_i$$ is the $$i$$th image and $$y_i$$ is the corresponsing label. For CIFAR-10, $$x$$ will be images, $$y$$ will be an integer between 0 and 9 to denote the class. $$N$$ examples in the training set.

Loss over the dataset is a sum of loss over the examples:

$$L = \frac{1}{N}\displaystyle\sum_{i}L_i(f(x_i, W), y_i)$$

We have some prediciotn function $$f$$ that takes in an image and our weight matrix $$W$$. $$L_i$$ takes the predicted scores coming out of $$f$$ and the actual true labels $$y$$ and give some quantitative value of how bad the predicted scores are. The final loss $$L$$ is the average ($$\frac{1}{N}\displaystyle\sum_{i}$$) of all the losses over the dataset.

This general setup is generally the same for a lot of different learning models.

Multiclass SVM (support vector machine) loss

We sum up the difference between scores for each incorrect class and the correct class. If the prediction function returns the correct class (with the arbitrary safety margin of $$1$$) the loss is 0.

$$L_i = \displaystyle\sum_{j \neq y_i}\text{max}(0, s_j - s_{y_i} + 1)$$,

Where $$s = f(x_i, W$$) and $$1$$ is an arbitrary safety margin. This function (where we take max of 0 and something) is also known as a hinge loss because of the shape of the graph when you plot the loss.

The equation above could also be written as an if/then (case-based) notation.

In English, this loss function says that we are happy when the true class score is a lot higher than all other classes.

At initialization $$W$$ is small, so all $$s\approx 1$$. We should expect a loss of $$\text{number of classes} - 1$$

If we went over all classes (instead of only $$j \neq y_i$$) the loss increases by 1. If you did this in practice, you would still find the same classifier, it's just nicer to aim for a loss of 0. You can also square the hinge function, to get a different loss function. If you're using a squared loss, very bad results will be very very bad (because it's now exponential). Using a linear vs a square depends on how much we care about different kinds of errors.

Numpy code to comput $$L_i$$

def L_i_vecotrized(x, y, W):
    scores = W.dot(x)
    margins = np.maximum(0, scores - scores[y] + 1)
    margins[y] = 0 # This is to remove the score for the correct class (easier than writing a loop or whatever)
    loss_i = np.sum(margins)
    return loss_i

$$W$$ that gives $$L=0$$ is not unique. $$2W$$ also gives $$L=0$$.

This is problematic. If this loss function is supposed to find the right $$W$$, how is it supposed to choose between these different versions of $$W$$ that all give $$L=0$$? This is because the loss we've written down is data loss: Model predictions should match training data. When in the real world, we don't really care how well the model matches the training data. This can be a source of overfitting, which we solve through Regularization. We normally add a regularization parameter to the loss function that encourages a "simple" $$W$$, where "simple" depends on the model.

$$L = \frac{1}{N}\displaystyle\sum_{i}L_i(f(x_i, W), y_i) + \lambda R(W)$$

This is Occam's Razor: Among competing hypothesis, the simplest is the best.

So the above loss function now has a data loss and a regularization loss, with a hyperparameter $$\lambda$$ which trades off between the two. This is a kind of soft constraint on the model.

There's a lot of different types of regularization:

These things all do the same thing: enocourage the model to be simple.

Softmax Classifier Loss (Mutinomial Logistic Regression). Cross-entropy loss!

Recall for the Multiclass SVM we didn't really give meaning to the scores.

Here the scores are unnormalized log probabilities of the classes.

$$P(Y=k|X=x_i) = \frac{e^sk}{\sum_j e^sj}$$

where $$s = f(x_i,W)$$

For a loss function, we want to minimize the negative log likelihoof of the correct class. If we know the image is a cat, the target probability would be 1 for cat and 0 for everything else.

$$ L_i = -\text{log} P(Y=y_i|X=x_i) $$

We want to enocourage the probability distribution coming out of the softmax to be the same as the target. We want the true class probability to be close to 1. In practice we:

  1. Take the scores out of the softmax classifier (unnormalized log probabilities)
  2. exponate (so they're all positive)
  3. Normalize (so they add up to 1)

The loss is the minus log of the correct class.

To get a 0 loss, you would have to have an inifite score for the correct class and minus infinity for everything else. Minimum is 0, and maximum is $$\infty$$. SVM will get the datapoint over the safety margin, while Softmax will keep pushing forever. In practice, the difference isn't that big.

Recap

This is a pretty generic view of supervised learning.

Now: Optimization

Gradient descent!

Imagine you're walking around this valley. The bottom of the valley is the minimum of the loss function. With complex models, there's really no hope to find an analytical solution to this.

Maybe the stupidest thing you could think of is random search: Just try different values for $$W$$ and see how they do.

In practice, it's better to use the local geometry of the valley (ie. the local gradient/slope).

In a one-dimensional function, the slope is the derivative of the function:

$$\frac{df(x)}{dx} = \displaystyle\lim_{h \to 0}\frac{f(x+h)-f(x)}{h}$$

You calculate the slope for a section of the curve, then let the length of that section go to 0 and you get the local slope.

We need to generalise to multi-dimensions: In multiple dimensions, the gradient is the vector of (partial derivatives) along each dimension. Each element in the gradient vector tells us the slope of the function if we go in that dimension.

A lot of deep learning is about computing gardients and using it. We could imagine computing the gradients using finite differences: Change each element in $$W$$ by a small amount, recompute the loss, measure the difference and approximate the gradient that way.

This is terrible because it is slow: A deep learning model might have 100s of millions of parameters. Better to use calculus, because we know the gradient is a function of $$W$$.

In practice, you use an analytical gradient, but you can use a numerical gradient to debug. This is called a gradient check.

Once we know how to compute the gradient, we get to gradient descent, which is just beautiful:

while True:
    weights_gradient = evaluate_gradient(loss_function, data, weights)
    weight += - step_size * weights_gradient

Recall the gradient points in the direction of greatest increase, so we use the minus gradient. We take small steps in the direction of minus gradient until the network converges.

The step size (or learning rate) is on of the most hyperparameter to set when you train these models. Because we could either take forever, or walk past the minimum if the learning rate is too small or too big.

The above example is very basic - we just make a small step in the direction of minus gradient. There are slightly fancier step rules that work better in practice:

But it's still the same basic algorithm.

Stochastic gradient descent

In practice, $$N$$ could be very, very large (thus slow). Thus, SGD, where we sample a random subset (mini-batch) of the training set at each training step (32,64,128 instances are common). In code:

while True:
    data_batch = sample_training_data(data, 256) #get 256 examples
    weights_gradient = evaluate_gradient(loss_fun, data_batch, weights)
    weights += - step_size * weights_gradient #parameter update

[Web demo]

Aside: Image Features

In practice, feeding linear pixel values into a linear classifier doesn't work so well. Before neural networks, you would compute some feature representation of the image and feed that into the linear classifier. The idea here is that non-linear datasets might become linearly seperable through some smart feature representation.

Examples feature representation

Today, this whole long feature extraction pipeline is replaced by Convolutional networks. Instead of writing down the features before training, we will let the network learn them from the data.

Next: Neural networks, backpropagation.