Introduction to Convolutional Neural Networks for Visual Recognition

Why do we study visual data?

The amount of data has exploded - this is due to the increase in sensors in the world. Roughly 80% of all traffic on the internet are video (Cisco 2016). Visual data is hard to understand - ie. the dark matter of the internet.

Computer vision touches on all kinds of fields:

CS224 on Deep Learning and Natural Language Processing

A history of computer vision

The number of species went from a few hundred to hundreds of thousands within 10 million years. Parker (2010) suggests this is because the development of vision. Now vision is the most important sensory systems in most intelligent animals.

None of these really managed to solve object recognition. Maybe we shoudl do object segmentation first?

Going into the 2000s, we start getting much better data (thanks to the internet). In the early 2000s we start to have benchmark datasets that allow us to measure progress in objet recognition. PASCAL Visual Object Challenge is one of these. It has 20 categories with 10,000 images each. Different research groups can compare progress on this. ImageNet (Deng, Dong, Socher, Li, Li, Fei-Fei)is the next iteration of this with 14M images in 22,000 categories. In 2009 the ImageNet team writes up the Large Scale Visual Recognition Challenge : 1000 Object classes, 1.4M images. (Russakovsky et al. 2014). By 2012 image recognition algorithms are on par with human performance (a single Stanford PHD student doing the challenge for weeks).

In 2012 the error rate drops by 10% - this is a convolutional neural network (this is what this course is about).

An overview of the course

We're focussed on image classification. This relatively simple tool is useful by itself, but we're also talking about object detection (drawing bounding boxes around objects in images) and image captioning (given an image, produce a natural language sentence describing that image).

Convolutional NNs had this breakthrough in 2012, since then we're basically fine-tuning and going from 8 to 200 layers. The general idea has been around since the 90s, but thanks to:

they're much more advanced now.

Human vision does much more than draw bounding boxes - there's forming 3d models of the world, activity recognition (given a video, how do you know what's happening).

Johnson et al. (2015): Image Retrieval using Scene Graphs

In some sense the holy grail of computer vision is to understand the story of an image in a rich and nuanced way.

[Barrack Obama pressing his foot on a scale]

We understand this as funny because we have all this incredible understandign and background information about this image - how scales work, who Obama is, how people feel about weight etc.

Goodfellow, Bengio and Courville: Deep Learning