Interpreting the visual world is one of those things that’s so easy for humans we’re hardly even conscious we’re doing it. When we see something, whether it’s car, or a tree, or our grandma, we don’t (usually) have to consciously study it before we can tell what it is. For a computer, however, identifying a human being at all (as opposed to a dog or a chair or a clock, let alone your grandmother) represents an amazingly difficult problem.
And the stakes for solving that problem are extremely high. Image recognition, and computer vision more broadly, is integral to a number of emerging technologies, from high-profile advances like driverless cars and facial recognition software to more prosaic but no less important developments, like building smart factories that can spot defects and irregularities on the assembly line, or developing software to allow insurance companies to process and categorize photographs of claims automatically.
We’re going to explore the challenge of image recognition and how data scientists are using a special type of neural network to address it.
Learning to see is hard (and expensive)
A good way to think about this problem is of applying metadata to unstructured data. In our article on content-based recommendations, we looked at some of the challenges of categorizing and searching content in cases where that metadata is sparse or nonexistent. Hiring human experts to manually tag libraries of movies and music may be a daunting task, but it’s an impossible one when it comes to challenges like teaching the navigation system in a driverless car to distinguish pedestrians crossing the road from other vehicles, or tagging, categorizing, and filtering the millions of user-uploaded pictures and videos that appear daily on social media.
One way to solve this would be through neural networks. While in theory we could use conventional neural networks to analyze images, in practice this turns out to prohibitively expensive from a computational perspective. For instance, a conventional neural network attempting to process even a relatively small image (let’s say 30×30 pixels) would still require 900 inputs and more than half a million parameters. While that might be manageable for a reasonably powerful machine, once the images become larger (say 500×500 pixels), the number of inputs and parameters required increases to truly absurd levels.
What’s more, applying neural networks to image recognition can lead to another problem: overfitting. Simply put, overfitting is what happens when a model tailors itself too closely to the data it’s been trained on. Not only does this generally lead to added parameters (and thus, further computational expense), it actually results in a loss in general performance when it’s exposed to new data.
The solution? Convolution!
Fortunately, a relatively straightforward change to the way a neural network is structured can make even large images more manageable. The result is what we call convolutional neural networks (also called CNNs or ConvNets).
One of the advantages of neural networks is their general applicability, but as we’ve seen when dealing with images, this advantage turns into a liability. CNNs make a conscious tradeoff: By designing a network specifically to handle images, we sacrifice some generalizability for a much more feasible solution.
Specifically, CNNs take advantage of the fact that, in any given image, proximity is strongly correlated with similarity. That is, two pixels that are near one another in a given image are more likely to be related than two pixels that are further apart. However, in a typical neural network, every pixel gets connected to every single neuron. In this case, the added computational load actually makes our network less rather than more accurate.
Convolution solves this by simply killing a lot of these less important connections. In more technical terms, CNNs make image processing computationally manageable by filtering connections by proximity. Rather than connecting every input to every neuron in a given layer, CNNs intentionally restrict connections so that any one neuron only accepts inputs from a small subsection of the layer before it (like, say, 3×3 or 5×5 pixels). Thus, each neuron is only responsible for processing a certain part of an image. (Incidentally, this is more or less how the individual cortical neurons in your brain work: Each neuron responds to only a small part of your overall visual field.)
Inside a convolutional neural network
But how does this filtering work? The secret is in the addition of two new types of layers: convolutional and pooling layers. We’ll break the process down below, using the example of a network designed to do just one thing: determine whether a picture contains a grandma or not.
The first step is the convolution layer, which actually consists of several steps in itself:
- First, we’ll break down a picture of grandma into a series of overlapping tiles 3×3 pixel tiles.
- Next, we’ll run each of these tiles through a simple, single-layer neural network, leaving the weights unchanged. This will turn our collection of tiles into an array. Because we kept each of the images small (in this case, 3×3), the neural network required to process them stays small and manageable.
- Then, we’ll take those output values and arrange them in an array that numerically represents the content of each area of our photograph, with the axes representing height, width, and color channels. So in our case, we’d have a 3x3x3 representation for each tile. (If we were talking about videos of grandma, we’d throw in a fourth dimension for time.)
Then comes the pooling layer, which takes these three-(or four-)dimensional arrays and applies a downsampling function alongside the spatial dimensions. The result is a pooled array containing only those parts of the image that are more important while discarding the rest, which both minimizes the computations we’ll need to do while also avoiding the problem of overfitting.
Lastly, we’ll take our downsampled array and use it as the input for a regular, fully connected neural network. Since we’ve dramatically reduced the size of the input using convolution and pooling, we should now have something a normal network can handle while still preserving the most important parts of the data. The output of this final step will represent how confident the system is that we have a picture of a grandma.
Note that this is a simplified explanation of how a convolutional neural network works. In real life, the process is (excuse the pun) more convoluted, involving multiple convolutional, pooling, and hidden layers. Additionally, real CNNs typically involve hundreds or thousands of labels, rather than just one.
Implementing convolutional neural networks
Building a Convolutional Neural Network from scratch can be a time-consuming and expensive undertaking. That said, a number of APIs have recently been developed that aim to allow organizations to glean insights from images without requiring in-house computer vision or machine learning expertise.
- Google Cloud Vision is Google’s visual recognition API, based on the open-source TensorFlow framework and using a REST API. It detects individual objects and faces and contains a pretty comprehensive set of labels. It also comes with a few bells and whistles, including OCR and integration with Google Image Search to find related entities and similar images from the web.
- IBM Watson Visual Recognition, part of the Watson Developer Cloud, comes with a large set of built-in classes, but is really built for training custom classes based on images you supply. Like Google Cloud Vision, it also supports a number of nifty features, including OCR and NSFW detection.
- Clarif.ai is an upstart image recognition service that also uses a REST API. One interesting aspect is that it comes with a number of modules that help tailor its algorithm to particular subjects, like weddings, travel, and food.
While the above APIs may be suitable for some general applications, for specific tasks you might still be better off building a custom solution. Luckily, there are a number of libraries available that make the lives of data scientists and developers a little easier by handling the computational and optimization aspects, allowing them to focus on training models. Many of these libraries, including TensorFlow, DeepLearning4J, Torch, and Theano, have been used successfully in a wide variety of applications.