Image Classification

Today we're diving into how learning algorithms work in detail.

As we know, numpy is good for efficient vector maths. [Tutorial]

Image classification: A core task in computer vision

Your system receives some input image. The model also has a fixed set of discrete labels:

$${cat, dog, plane, car}$$

The task is to assign a class to the image.

To the computer the image isn't a hollistic iimage of a cat - it's just a huge array of numbers between 0 and 255. The difference between the semantic idea of "cat" and this huge array is known as the semantic gap.

You can change the image in very subtle ways (ie. move the camera), then every singl pixel in the array would be different, even thugh it;s stil the same cat. Other challenges include:

Our brain is evolved to do all of this, but it's very hard to do automatically. Amazingly, some of this tech works at human-level performance in som limited problems.

Unlike, say, sorting an array, there's no obvious way to hard-code the algorithm for recognizing the cat.

Attempts have been made though:

  1. Edge detection
  2. Categorize the different corners etc, write down a set of rules to recognize a cat

This is:

A data-driven approach is much better.

  1. Collect a dataset of images and labels
  2. Use machine learning to train a classifier (Ingest all the data and summarise it in some way)
  3. Evaluate the classifier on new images

Our API has changed: rather than one function that takes an image and returns a label, we have two:

Of course the concept of a data-driven approach is much more general

First classifier: Nearest Neighbour

train(): memorize all data and labels predict(): Predict the label based on the most similar source image

Let's do this on CIFAR10

Given a pair of images, how do we actually compare them?

L1 distance: $$d_1(I_1, I_2) = \displaystyle\sum_{p}|I_1^p-I_2^p|$$

This just compares all pixels in each image and substracts them, finally adds up the differences to produce a scalar. It's kind of a stupid way to compare images, but it does reasonable things sometimes.

[python code to do nearest neighbour classifier]

import numpy as np

class NearestNeighbour:
    def __init__(self):
        pass

    def train(self, X, y):
    # X is N x D where each row is an example. Y is a one-dimensional vector of size N. N of course the number of examples
    # Below, we're just memorising all the training data and all the examples
    self.Xtr = X
    self.ytr = y

    def predict(self, X):
    # X is N x D where each row is an example we want to predict a label for
    num_test = X.shape[0]
    # Let's make sure that teh output type matches th input type
    Ypred = np.zeros(num_test, dtype = self.ytr.dtype)

    for i in xrange(num_test):
        #L1 distance:
        distances = np.sum(np.abs(self.Xtr - X[i,:]), axis=1) # this returns a vector with the distance to each training image
        min_index = np.argmin(distances)
        Ypred[i] = self.ytr[min_index]
    return Ypred

Computation: train O(1) predict O(N)

This is bad: We want classifiers that are slow at training time, but fast at testing time - this is the wrong way around. Neural networks are the reverse of this.

We can draw the decision boundaries of the nearest neighbour classifier. By looking at the picturewe start to see some of the problems:

This motivates k-nearest neghbours. Instead of just finding the nearest neighbour, we find k nearest neighbours, and take a majority vote as to whihch class a datapoint should belong to. This leads to smoother decision boundaries. You almost always want K > 1.

When thinking about computer vision, it's good to flip back and forth between different viewpoints:

L2 (Euclidian) distance: $$d_2(I_1,I_2) = \sqrt{\displaystyle\sum_{p}(I_1^p-I_2^p)^2}$$

Different distance metrics make different assumptions about the geometry of the underlying space. L1 produces a square, L2 a circle. L1 changes based on the coordinate system. By using a generalised distance metric, we can apply k-nearest-neighbours to basicall any kind of data.

Decision boundaries in L1 tends to follow the coordinate axis, whereas L2 boundaries are more natural.

Online demo

Once you use this algorithm in practice, you need to set hyperparameters (k, distance measure). These aren't learned from teh data, but choices you make ahead of time. How do you do this?

When you do cross-validation, you can draw a graph for different values of hyperparameters (ie. $$k$$).

KNN is basically never used on images because:

Linear Classification

An analogy of neural networks are lego blocks - you stick together all these modules to make a model. Of course a ReLU is just a linear classifier.

Recall CIFAR10:

Parametric approach

$$\text{Image} \rightarrow f(x,W) \rightarrow 10 \text{ numbers giving class scores}$$

The image is a 32x32x3 vector Larger class scores mean that the class is more likely. In the KNN setup we just keep around the training data. Here, we summarise the training data into $$W$$, so we can throw away the training data at test time.

There are all kinds of ways to combine data and weights, teh simplest is to multiply them.

For a 32x32x3 image:

  1. We turn the image into a $$3072*1$$ vector
  2. The weights are $$10 * 3072$$

$$ f(x,W) = Wx + b$$

The bias term is giving us data-independent preference for one class over another. If your dataset is unbalanced, the bias term can even things out.

Linear classification is a template-matching approach. Which means you can go backwards and visualise the rows in the weight matrix and see the templates! (This also useful for CCA workshop)

The problem is also apparent: The linear classifier only learns one template per class. If there's interclass variation, it's going to have problems. Neural networks don't have this restriction - they can learn more than one template.

In high-dimensional space, each image is a point. The linear classifier is drawing planes through this space to seperate the data.

Linear classification struggles when the decision boundaries aren't continuous, ie. if there's islands in the plane.

Next: How can we determine if $$W$$ is good or bad?