Image Classification

Today we're diving into how learning algorithms work in detail.

As we know, numpy is good for efficient vector maths. [Tutorial]

Image classification: A core task in computer vision

Your system receives some input image. The model also has a fixed set of discrete labels:

$${cat, dog, plane, car}$$

The task is to assign a class to the image.

To the computer the image isn't a hollistic iimage of a cat - it's just a huge array of numbers between 0 and 255. The difference between the semantic idea of "cat" and this huge array is known as the semantic gap.

You can change the image in very subtle ways (ie. move the camera), then every singl pixel in the array would be different, even thugh it;s stil the same cat. Other challenges include:

Illumination
Object deformation
Occlusion (we might only see parts of the cat)
Background clutter (the object looks the same as the background)
Intraclass variation (cats come in different shapes and sizes)

Our brain is evolved to do all of this, but it's very hard to do automatically. Amazingly, some of this tech works at human-level performance in som limited problems.

Unlike, say, sorting an array, there's no obvious way to hard-code the algorithm for recognizing the cat.

Attempts have been made though:

Edge detection
Categorize the different corners etc, write down a set of rules to recognize a cat

This is:

Brittle
not scalable

A data-driven approach is much better.

Collect a dataset of images and labels
Use machine learning to train a classifier (Ingest all the data and summarise it in some way)
Evaluate the classifier on new images

Our API has changed: rather than one function that takes an image and returns a label, we have two:

train(), which takes images and trains a model
predict(), which takes a model and makes a prediction

Of course the concept of a data-driven approach is much more general

First classifier: Nearest Neighbour

train(): memorize all data and labels predict(): Predict the label based on the most similar source image

Let's do this on CIFAR10

10 classes
50,000 training images
10,000 testing images

Given a pair of images, how do we actually compare them?

L1 distance: $$d_1(I_1, I_2) = \displaystyle\sum_{p}|I_1^p-I_2^p|$$

This just compares all pixels in each image and substracts them, finally adds up the differences to produce a scalar. It's kind of a stupid way to compare images, but it does reasonable things sometimes.

[python code to do nearest neighbour classifier]

import numpy as np

class NearestNeighbour:
    def __init__(self):
        pass

    def train(self, X, y):
    # X is N x D where each row is an example. Y is a one-dimensional vector of size N. N of course the number of examples
    # Below, we're just memorising all the training data and all the examples
    self.Xtr = X
    self.ytr = y

    def predict(self, X):
    # X is N x D where each row is an example we want to predict a label for
    num_test = X.shape[0]
    # Let's make sure that teh output type matches th input type
    Ypred = np.zeros(num_test, dtype = self.ytr.dtype)

    for i in xrange(num_test):
        #L1 distance:
        distances = np.sum(np.abs(self.Xtr - X[i,:]), axis=1) # this returns a vector with the distance to each training image
        min_index = np.argmin(distances)
        Ypred[i] = self.ytr[min_index]
    return Ypred

Computation: train O(1) predict O(N)

This is bad: We want classifiers that are slow at training time, but fast at testing time - this is the wrong way around. Neural networks are the reverse of this.

We can draw the decision boundaries of the nearest neighbour classifier. By looking at the picturewe start to see some of the problems:

a little yellow island in the middle of the green region
noisy blue region

This motivates k-nearest neghbours. Instead of just finding the nearest neighbour, we find k nearest neighbours, and take a majority vote as to whihch class a datapoint should belong to. This leads to smoother decision boundaries. You almost always want K > 1.

When thinking about computer vision, it's good to flip back and forth between different viewpoints:

high-dimensional points in the plane
actual images

L2 (Euclidian) distance: $$d_2(I_1,I_2) = \sqrt{\displaystyle\sum_{p}(I_1^p-I_2^p)^2}$$

Different distance metrics make different assumptions about the geometry of the underlying space. L1 produces a square, L2 a circle. L1 changes based on the coordinate system. By using a generalised distance metric, we can apply k-nearest-neighbours to basicall any kind of data.

Decision boundaries in L1 tends to follow the coordinate axis, whereas L2 boundaries are more natural.

Online demo

Once you use this algorithm in practice, you need to set hyperparameters (k, distance measure). These aren't learned from teh data, but choices you make ahead of time. How do you do this?

Idea 1: Choose hyperparameters that work best on training data. Bad. K=1 always works perfectly on training data. Ie. leads to overfitting.
Idea 2: Split data into train and test, choose hyperparameters that work best on test data. Also bad. We'd have no understanding of how algorithm will perform on new data.
Idea 3: Split data into train, validation and test sets, choose hyperparameters on validation and evaluate on test. You only ever run the algorithm once on the test data.
Idea 4: Cross validation. We'll take our data (save test set), and split the data into lots of different folds. Try each fold as validation and average the results. This is kind of the gold standard, but hard to use in deep learning.

When you do cross-validation, you can draw a graph for different values of hyperparameters (ie. $$k$$).

KNN is basically never used on images because:

Slow at test time
And L2 isn't very good at findin perceptual difference at images
Curse of dimensionality. For knn to work well, we need the space densely covered (otherwise the nearest neighbours could be far away). To do this, we need an exponential number of training instances to cover the space well enough.

Linear Classification

An analogy of neural networks are lego blocks - you stick together all these modules to make a model. Of course a ReLU is just a linear classifier.

Recall CIFAR10:

50,000 train images
10,000 test images
Each image 32x32x3
10 classes

Parametric approach

$$\text{Image} \rightarrow f(x,W) \rightarrow 10 \text{ numbers giving class scores}$$

The image is a 32x32x3 vector Larger class scores mean that the class is more likely. In the KNN setup we just keep around the training data. Here, we summarise the training data into $$W$$, so we can throw away the training data at test time.

There are all kinds of ways to combine data and weights, teh simplest is to multiply them.

For a 32x32x3 image:

We turn the image into a $$3072*1$$ vector
The weights are $$10 * 3072$$

Bias term is 10x1
we end up with a $$10 * 1$$ vector giving us probabillities for each class.

$$ f(x,W) = Wx + b$$

The bias term is giving us data-independent preference for one class over another. If your dataset is unbalanced, the bias term can even things out.

Linear classification is a template-matching approach. Which means you can go backwards and visualise the rows in the weight matrix and see the templates! (This also useful for CCA workshop)

The problem is also apparent: The linear classifier only learns one template per class. If there's interclass variation, it's going to have problems. Neural networks don't have this restriction - they can learn more than one template.

In high-dimensional space, each image is a point. The linear classifier is drawing planes through this space to seperate the data.

Linear classification struggles when the decision boundaries aren't continuous, ie. if there's islands in the plane.

Next: How can we determine if $$W$$ is good or bad?

Loss function
Optimisation
Convolutional Nets