Introduction to Convolutional Neural Networks for Visual Recognition

Why do we study visual data?

The amount of data has exploded - this is due to the increase in sensors in the world. Roughly 80% of all traffic on the internet are video (Cisco 2016). Visual data is hard to understand - ie. the dark matter of the internet.

Computer vision touches on all kinds of fields:

Physics (we need to understand optics)
Biology (how do animal brains see images)
Mathematics

CS224 on Deep Learning and Natural Language Processing

A history of computer vision

The number of species went from a few hundred to hundreds of thousands within 10 million years. Parker (2010) suggests this is because the development of vision. Now vision is the most important sensory systems in most intelligent animals.

Camera Obscura (Renaisscance) not dissimilar to early animal eyes.
Hubel, Wiesel (1959) as what the visual processing mechanism is like in animals. They show that visual information is processed on different levels - from simple features to more complex ones.
Larry Roberts (1963): Block World.
1966: The summer vision project at MIT
David Marr: Vision suggests a process that goes from "Primal Sketch" (curves, lines bars, ends, boundaries) to 2.5D sketch (local surface orientation, discontinuities in depth) to 3d model representation
Brooks, Binform (1979) and Fischler, Elschlager (1973) start to ask how we can represent objects in the real world. The shared idea is that complex objects can be pieced together from primitive shapes in geometric configuration.
David Lowe (1987) on edge detection

None of these really managed to solve object recognition. Maybe we shoudl do object segmentation first?

Shi, Malik (1997) use graph theory to segment images
Viola, Jones (2001) are the first to do real-time face detection. The early 2000s is when machine learning techniques start to gain traction. In 2006, five years after this paper, Fujifilm rolls out the first camera with built-in face recognition.
Lowe (1999) on Feature-based object recognition. While objects may vary between photographs (lighting, perspective etc.) there are features that remain the same (diagnostic). Matching these features is much easier than matching the whole object.
Human body recognition, modelling: Dalal, Triggs (2005) and Felzenswalb, McAllester, Ramanan (2009)

Going into the 2000s, we start getting much better data (thanks to the internet). In the early 2000s we start to have benchmark datasets that allow us to measure progress in objet recognition. PASCAL Visual Object Challenge is one of these. It has 20 categories with 10,000 images each. Different research groups can compare progress on this. ImageNet (Deng, Dong, Socher, Li, Li, Fei-Fei)is the next iteration of this with 14M images in 22,000 categories. In 2009 the ImageNet team writes up the Large Scale Visual Recognition Challenge : 1000 Object classes, 1.4M images. (Russakovsky et al. 2014). By 2012 image recognition algorithms are on par with human performance (a single Stanford PHD student doing the challenge for weeks).

In 2012 the error rate drops by 10% - this is a convolutional neural network (this is what this course is about).

An overview of the course

We're focussed on image classification. This relatively simple tool is useful by itself, but we're also talking about object detection (drawing bounding boxes around objects in images) and image captioning (given an image, produce a natural language sentence describing that image).

Convolutional NNs had this breakthrough in 2012, since then we're basically fine-tuning and going from 8 to 200 layers. The general idea has been around since the 90s, but thanks to:

Moore's law
GPUs
Loads of data

they're much more advanced now.

Human vision does much more than draw bounding boxes - there's forming 3d models of the world, activity recognition (given a video, how do you know what's happening).

Johnson et al. (2015): Image Retrieval using Scene Graphs

In some sense the holy grail of computer vision is to understand the story of an image in a rich and nuanced way.

[Barrack Obama pressing his foot on a scale]

We understand this as funny because we have all this incredible understandign and background information about this image - how scales work, who Obama is, how people feel about weight etc.

Goodfellow, Bengio and Courville: Deep Learning