We have an un­der­stand­ing of the lin­ear clas­si­fier, but we don’t re­ally know how to get the cor­rect $$W$$.

To do:

  1. Define a Loss func­tion that quan­ti­fies our un­hap­pi­ness with the scores across the train­ing data
  2. Come up with pa­ra­me­ters that ef­fi­ciently min­i­mize the loss func­tion (op­ti­miza­tion). Ie. search­ing through the space of pos­si­ble $$W$$

Loss func­tion

Given a dataset of ex­am­ples

$${(x_i,y_i)}^N_{i=1}$$

Where $$x_i$$ is the $$i$$th im­age and $$y_i$$ is the cor­re­spon­s­ing la­bel. For CIFAR-10, $$x$$ will be im­ages, $$y$$ will be an in­te­ger be­tween 0 and 9 to de­note the class. $$N$$ ex­am­ples in the train­ing set.

Loss over the dataset is a sum of loss over the ex­am­ples:

$$L = \frac{1}{N}\displaystyle\sum_{i}L_i(f(x_i, W), y_i)$$

We have some predi­ciotn func­tion $$f$$ that takes in an im­age and our weight ma­trix $$W$$. $$L_i$$ takes the pre­dicted scores com­ing out of $$f$$ and the ac­tual true la­bels $$y$$ and give some quan­ti­ta­tive value of how bad the pre­dicted scores are. The fi­nal loss $$L$$ is the av­er­age ($$\frac{1}{N}\displaystyle\sum_{i}$$) of all the losses over the dataset.

This gen­eral setup is gen­er­ally the same for a lot of dif­fer­ent learn­ing mod­els.

Multiclass SVM (support vec­tor ma­chine) loss

We sum up the dif­fer­ence be­tween scores for each in­cor­rect class and the cor­rect class. If the pre­dic­tion func­tion re­turns the cor­rect class (with the ar­bi­trary safety mar­gin of $$1$$) the loss is 0.

$$L_i = \displaystyle\sum_{j \neq y_i}\text{max}(0, s_j - s_{y_i} + 1)$$,

Where $$s = f(x_i, W$$) and $$1$$ is an ar­bi­trary safety mar­gin. This func­tion (where we take max of 0 and some­thing) is also known as a hinge loss be­cause of the shape of the graph when you plot the loss.

The equa­tion above could also be writ­ten as an if/​then (case-based) no­ta­tion.

In English, this loss func­tion says that we are happy when the true class score is a lot higher than all other classes.

At ini­tial­iza­tion $$W$$ is small, so all $$s\approx 1$$. We should ex­pect a loss of $$\text{number of classes} - 1$$

If we went over all classes (instead of only $$j \neq y_i$$) the loss in­creases by 1. If you did this in prac­tice, you would still find the same clas­si­fier, it’s just nicer to aim for a loss of 0. You can also square the hinge func­tion, to get a dif­fer­ent loss func­tion. If you’re us­ing a squared loss, very bad re­sults will be very very bad (because it’s now ex­po­nen­tial). Using a lin­ear vs a square de­pends on how much we care about dif­fer­ent kinds of er­rors.

Numpy code to com­put $$L_i$$

def L_i_vecotrized(x, y, W):
    scores = W.dot(x)
    margins = np.maximum(0, scores - scores[y] + 1)
    margins[y] = 0 # This is to remove the score for the correct class (easier than writing a loop or whatever)
    loss_i = np.sum(margins)
    return loss_i

$$W$$ that gives $$L=0$$ is not unique. $$2W$$ also gives $$L=0$$.

This is prob­lem­atic. If this loss func­tion is sup­posed to find the right $$W$$, how is it sup­posed to choose be­tween these dif­fer­ent ver­sions of $$W$$ that all give $$L=0$$? This is be­cause the loss we’ve writ­ten down is data loss: Model pre­dic­tions should match train­ing data. When in the real world, we don’t re­ally care how well the model matches the train­ing data. This can be a source of over­fit­ting, which we solve through Regularization. We nor­mally add a reg­u­lar­iza­tion pa­ra­me­ter to the loss func­tion that en­cour­ages a simple” $$W$$, where simple” de­pends on the model.

$$L = \frac{1}{N}\displaystyle\sum_{i}L_i(f(x_i, W), y_i) + \lambda R(W)$$

This is Occam’s Razor: Among com­pet­ing hy­poth­e­sis, the sim­plest is the best.

So the above loss func­tion now has a data loss and a reg­u­lar­iza­tion loss, with a hy­per­pa­ra­me­ter $$\lambda$$ which trades off be­tween the two. This is a kind of soft con­straint on the model.

There’s a lot of dif­fer­ent types of reg­u­lar­iza­tion:

These things all do the same thing: eno­courage the model to be sim­ple.

Softmax Classifier Loss (Mutinomial Logistic Regression). Cross-entropy loss!

Recall for the Multiclass SVM we did­n’t re­ally give mean­ing to the scores.

Here the scores are un­nor­mal­ized log prob­a­bil­i­ties of the classes.

$$P(Y=k|X=x_i) = \frac{e^sk}{\sum_j e^sj}$$

where $$s = f(x_i,W)$$

For a loss func­tion, we want to min­i­mize the neg­a­tive log like­li­hoof of the cor­rect class. If we know the im­age is a cat, the tar­get prob­a­bil­ity would be 1 for cat and 0 for every­thing else.

$$ L_i = -\text{log} P(Y=y_i|X=x_i) $$

We want to eno­courage the prob­a­bil­ity dis­tri­b­u­tion com­ing out of the soft­max to be the same as the tar­get. We want the true class prob­a­bil­ity to be close to 1. In prac­tice we:

  1. Take the scores out of the soft­max clas­si­fier (unnormalized log prob­a­bil­i­ties)
  2. ex­ponate (so they’re all pos­i­tive)
  3. Normalize (so they add up to 1)

The loss is the mi­nus log of the cor­rect class.

To get a 0 loss, you would have to have an in­ifite score for the cor­rect class and mi­nus in­fin­ity for every­thing else. Minimum is 0, and max­i­mum is $$\infty$$. SVM will get the dat­a­point over the safety mar­gin, while Softmax will keep push­ing for­ever. In prac­tice, the dif­fer­ence is­n’t that big.

Recap

This is a pretty generic view of su­per­vised learn­ing.

Now: Optimization

Gradient de­scent!

Imagine you’re walk­ing around this val­ley. The bot­tom of the val­ley is the min­i­mum of the loss func­tion. With com­plex mod­els, there’s re­ally no hope to find an an­a­lyt­i­cal so­lu­tion to this.

Maybe the stu­pid­est thing you could think of is ran­dom search: Just try dif­fer­ent val­ues for $$W$$ and see how they do.

In prac­tice, it’s bet­ter to use the lo­cal geom­e­try of the val­ley (ie. the lo­cal gra­di­ent/​slope).

In a one-di­men­sional func­tion, the slope is the de­riv­a­tive of the func­tion:

$$\frac{df(x)}{dx} = \displaystyle\lim_{h \to 0}\frac{f(x+h)-f(x)}{h}$$

You cal­cu­late the slope for a sec­tion of the curve, then let the length of that sec­tion go to 0 and you get the lo­cal slope.

We need to gen­er­alise to multi-di­men­sions: In mul­ti­ple di­men­sions, the gra­di­ent is the vec­tor of (partial de­riv­a­tives) along each di­men­sion. Each el­e­ment in the gra­di­ent vec­tor tells us the slope of the func­tion if we go in that di­men­sion.

A lot of deep learn­ing is about com­put­ing gar­di­ents and us­ing it. We could imag­ine com­put­ing the gra­di­ents us­ing fi­nite dif­fer­ences: Change each el­e­ment in $$W$$ by a small amount, re­com­pute the loss, mea­sure the dif­fer­ence and ap­prox­i­mate the gra­di­ent that way.

This is ter­ri­ble be­cause it is slow: A deep learn­ing model might have 100s of mil­lions of pa­ra­me­ters. Better to use cal­cu­lus, be­cause we know the gra­di­ent is a func­tion of $$W$$.

In prac­tice, you use an an­a­lyt­i­cal gra­di­ent, but you can use a nu­mer­i­cal gra­di­ent to de­bug. This is called a gra­di­ent check.

Once we know how to com­pute the gra­di­ent, we get to gra­di­ent de­scent, which is just beau­ti­ful:

while True:
    weights_gradient = evaluate_gradient(loss_function, data, weights)
    weight += - step_size * weights_gradient

Recall the gra­di­ent points in the di­rec­tion of great­est in­crease, so we use the mi­nus gra­di­ent. We take small steps in the di­rec­tion of mi­nus gra­di­ent un­til the net­work con­verges.

The step size (or learn­ing rate) is on of the most hy­per­pa­ra­me­ter to set when you train these mod­els. Because we could ei­ther take for­ever, or walk past the min­i­mum if the learn­ing rate is too small or too big.

The above ex­am­ple is very ba­sic - we just make a small step in the di­rec­tion of mi­nus gra­di­ent. There are slightly fancier step rules that work bet­ter in prac­tice:

But it’s still the same ba­sic al­go­rithm.

Stochastic gra­di­ent de­scent

In prac­tice, $$N$$ could be very, very large (thus slow). Thus, SGD, where we sam­ple a ran­dom sub­set (mini-batch) of the train­ing set at each train­ing step (32,64,128 in­stances are com­mon). In code:

while True:
    data_batch = sample_training_data(data, 256) #get 256 examples
    weights_gradient = evaluate_gradient(loss_fun, data_batch, weights)
    weights += - step_size * weights_gradient #parameter update

[Web demo]

Aside: Image Features

In prac­tice, feed­ing lin­ear pixel val­ues into a lin­ear clas­si­fier does­n’t work so well. Before neural net­works, you would com­pute some fea­ture rep­re­sen­ta­tion of the im­age and feed that into the lin­ear clas­si­fier. The idea here is that non-lin­ear datasets might be­come lin­early seper­a­ble through some smart fea­ture rep­re­sen­ta­tion.

Examples fea­ture rep­re­sen­ta­tion

Today, this whole long fea­ture ex­trac­tion pipeline is re­placed by Convolutional net­works. Instead of writ­ing down the fea­tures be­fore train­ing, we will let the net­work learn them from the data.

Next: Neural net­works, back­prop­a­ga­tion.