Today we’re div­ing into how learn­ing al­go­rithms work in de­tail.

As we know, numpy is good for ef­fi­cient vec­tor maths. [Tutorial]

Image clas­si­fi­ca­tion: A core task in com­puter vi­sion

Your sys­tem re­ceives some in­put im­age. The model also has a fixed set of dis­crete la­bels:

$${cat, dog, plane, car}$$

The task is to as­sign a class to the im­age.

To the com­puter the im­age is­n’t a hol­lis­tic iim­age of a cat - it’s just a huge ar­ray of num­bers be­tween 0 and 255. The dif­fer­ence be­tween the se­man­tic idea of cat” and this huge ar­ray is known as the se­man­tic gap.

You can change the im­age in very sub­tle ways (ie. move the cam­era), then every singl pixel in the ar­ray would be dif­fer­ent, even thugh it;s stil the same cat. Other chal­lenges in­clude:

Our brain is evolved to do all of this, but it’s very hard to do au­to­mat­i­cally. Amazingly, some of this tech works at hu­man-level per­for­mance in som lim­ited prob­lems.

Unlike, say, sort­ing an ar­ray, there’s no ob­vi­ous way to hard-code the al­go­rithm for rec­og­niz­ing the cat.

Attempts have been made though:

  1. Edge de­tec­tion
  2. Categorize the dif­fer­ent cor­ners etc, write down a set of rules to rec­og­nize a cat

This is:

A data-dri­ven ap­proach is much bet­ter.

  1. Collect a dataset of im­ages and la­bels
  2. Use ma­chine learn­ing to train a clas­si­fier (Ingest all the data and sum­marise it in some way)
  3. Evaluate the clas­si­fier on new im­ages

Our API has changed: rather than one func­tion that takes an im­age and re­turns a la­bel, we have two:

Of course the con­cept of a data-dri­ven ap­proach is much more gen­eral

First clas­si­fier: Nearest Neighbour

train(): mem­o­rize all data and la­bels predict(): Predict the la­bel based on the most sim­i­lar source im­age

Let’s do this on CIFAR10

Given a pair of im­ages, how do we ac­tu­ally com­pare them?

L1 dis­tance: $$d_1(I_1, I_2) = \displaystyle\sum_{p}|I_1^p-I_2^p|$$

This just com­pares all pix­els in each im­age and sub­stracts them, fi­nally adds up the dif­fer­ences to pro­duce a scalar. It’s kind of a stu­pid way to com­pare im­ages, but it does rea­son­able things some­times.

[python code to do near­est neigh­bour clas­si­fier]

import numpy as np

class NearestNeighbour:
    def __init__(self):

    def train(self, X, y):
    # X is N x D where each row is an example. Y is a one-dimensional vector of size N. N of course the number of examples
    # Below, we're just memorising all the training data and all the examples
    self.Xtr = X
    self.ytr = y

    def predict(self, X):
    # X is N x D where each row is an example we want to predict a label for
    num_test = X.shape[0]
    # Let's make sure that teh output type matches th input type
    Ypred = np.zeros(num_test, dtype = self.ytr.dtype)

    for i in xrange(num_test):
        #L1 distance:
        distances = np.sum(np.abs(self.Xtr - X[i,:]), axis=1) # this returns a vector with the distance to each training image
        min_index = np.argmin(distances)
        Ypred[i] = self.ytr[min_index]
    return Ypred

Computation: train O(1) predict O(N)

This is bad: We want clas­si­fiers that are slow at train­ing time, but fast at test­ing time - this is the wrong way around. Neural net­works are the re­verse of this.

We can draw the de­ci­sion bound­aries of the near­est neigh­bour clas­si­fier. By look­ing at the pic­turewe start to see some of the prob­lems:

This mo­ti­vates k-near­est negh­bours. Instead of just find­ing the near­est neigh­bour, we find k near­est neigh­bours, and take a ma­jor­ity vote as to whihch class a dat­a­point should be­long to. This leads to smoother de­ci­sion bound­aries. You al­most al­ways want K > 1.

When think­ing about com­puter vi­sion, it’s good to flip back and forth be­tween dif­fer­ent view­points:

L2 (Euclidian) dis­tance: $$d_2(I_1,I_2) = \sqrt{\displaystyle\sum_{p}(I_1^p-I_2^p)^2}$$

Different dis­tance met­rics make dif­fer­ent as­sump­tions about the geom­e­try of the un­der­ly­ing space. L1 pro­duces a square, L2 a cir­cle. L1 changes based on the co­or­di­nate sys­tem. By us­ing a gen­er­alised dis­tance met­ric, we can ap­ply k-near­est-neigh­bours to ba­si­call any kind of data.

Decision bound­aries in L1 tends to fol­low the co­or­di­nate axis, whereas L2 bound­aries are more nat­ural.

Online demo

Once you use this al­go­rithm in prac­tice, you need to set hy­per­pa­ra­me­ters (k, dis­tance mea­sure). These aren’t learned from teh data, but choices you make ahead of time. How do you do this?

When you do cross-val­i­da­tion, you can draw a graph for dif­fer­ent val­ues of hy­per­pa­ra­me­ters (ie. $$k$$).

KNN is ba­si­cally never used on im­ages be­cause:

Linear Classification

An anal­ogy of neural net­works are lego blocks - you stick to­gether all these mod­ules to make a model. Of course a ReLU is just a lin­ear clas­si­fier.

Recall CIFAR10:

Parametric ap­proach

$$\text{Image} \rightarrow f(x,W) \rightarrow 10 \text{ num­bers giv­ing class scores}$$

The im­age is a 32x32x3 vec­tor Larger class scores mean that the class is more likely. In the KNN setup we just keep around the train­ing data. Here, we sum­marise the train­ing data into $$W$$, so we can throw away the train­ing data at test time.

There are all kinds of ways to com­bine data and weights, teh sim­plest is to mul­ti­ply them.

For a 32x32x3 im­age:

  1. We turn the im­age into a $$3072*1$$ vec­tor
  2. The weights are $$10 * 3072$$

$$ f(x,W) = Wx + b$$

The bias term is giv­ing us data-in­de­pen­dent pref­er­ence for one class over an­other. If your dataset is un­bal­anced, the bias term can even things out.

Linear clas­si­fi­ca­tion is a tem­plate-match­ing ap­proach. Which means you can go back­wards and vi­su­alise the rows in the weight ma­trix and see the tem­plates! (This also use­ful for CCA work­shop)

The prob­lem is also ap­par­ent: The lin­ear clas­si­fier only learns one tem­plate per class. If there’s in­ter­class vari­a­tion, it’s go­ing to have prob­lems. Neural net­works don’t have this re­stric­tion - they can learn more than one tem­plate.

In high-di­men­sional space, each im­age is a point. The lin­ear clas­si­fier is draw­ing planes through this space to seper­ate the data.

Linear clas­si­fi­ca­tion strug­gles when the de­ci­sion bound­aries aren’t con­tin­u­ous, ie. if there’s is­lands in the plane.

Next: How can we de­ter­mine if $$W$$ is good or bad?