#native_company# #native_desc#
#native_cta#

Convolutional Neural Networks

Let's get physical

Sometimes, nothing beats holding a copy of a book in your hands. Writing in the margins, highlighting sentences, folding corners. So this book is also available from Amazon as a paperback.

Buy now on Amazon

So far, we’ve learned how to implement MNIST using a straightforward, densely connected neural network. It’s simple to understand and implement; however, the predictive capacity has limits as can be seen from the previous example, it does not always, or even often, predict the correct number. There are many other types of neural networks you can create to solve this problem. One common type of neural network for solving image classification problems, which MNIST is an example of, is a Convolutional Neural Network (CNN).

CNN model contains several layers of varying styles. Some are called convolution layers, and some are pooling layers, and then some are what we’ve seen so far as densely connected layers. Then we have the output predictions layer, which is the same as the densely connected neural network we created previously.

We’ll explain what each of the different layer types is and what function they serve. I think a CNN model goes a long way to explaining the power of neural networks and why they are so good at solving several seemingly incredibly hard problems.

Visualizing a CNN

A great application that helps you visualize how a CNN works is An Interactive Node-Link Visualization of Convolutional Neural Networks[1]. I recommend you spend a little time now navigating that site. Draw a number and then hover over the individual pixels in the different layers in the 3D representation to the side to see how the data is aggregated up to the top layer, which is the ten output tensor we learned out in the previous lectures.

Convolution

The first type of layer in a CNN is a convolutional layer, which contains several convoluted features.

convoluted feature is a matrix which we multiply across the original image (again that is just a matrix) to generate a new matrix. This is an excellent visualization of a feature:

conv feat 1
Figure 1. Start
conv feat 2
Figure 2. Somewhere in the middle
conv feat 3
Figure 3. Finish

I’m willing to bet you’ve used a feature like this many times, probably today. It’s how we manipulate images in photo apps, for instance, if we wanted to sharpen an image we may use a feature like so:

sharpen matrix

Multiplying an image with the above matrix results in a new sharper image, like so:

5.mnist.004
Figure 4. Sharpening an image

So each convoluted feature results in a new image with some data highlighted. The above feature sharpens an image. Other features may find edges or curves or a vast myriad of different things.

So the first convolution layer in a CNN has a set of such features, each feature creates a new image with some data highlighted.

Some features might be more useful than others to solve certain types of problems. For example, to solve our MNIST problem, we may need features that highlight straight lines and others that highlight curves to highlight and distinguish different features of digits.

What features should you pick?

Given the type of problem you are trying to solve with this neural network, what are the features you want to pick? This is where things get very interesting; you don’t choose the features, the neural network evolves the features during a training process; it learns to highlight certain parts of the images.

Each feature is just a set of weights, like so:

weights matrix

Like other weights in a neural network, it is initialized to a set of random numbers, like perhaps so:

generated values matrix

Over time, as we train the model, it tweaks the weights in the feature, learning to highlight parts of the image and hiding others.

How we represent this in TensorFlow is with a layer declaration like so:

tf.layers.conv2d({
    inputShape: [28, 28, 1], (1)
    kernelSize: 3, (2)
    filters: 16, (3)
    activation: "relu" (4)
})
1 This is the shape of the image, the input data we want to run the convolution over.
2 This is the size of the filter, so 3 means a 3x3 matrix.
3 This is the number of filters we want to create so that this layer will output 16 copies of the input image with a different filter applied to each.
4 Also, we add an activation function, in this case, relu, to each layer — it will scale the output values according to the relu function.

Important

The above layer definition hides much complexity for us; it’s creating underneath 16x3x3 = 144 different weights! Each of these weights will need to be tuned by our neural network model.

Pooling

The above results in an explosion of images, with 16 features each input image results in 16 copies, this also means you need 16 times the memory on your computer to hold all that data. That’s one of the challenges of CNNs, they require a lot more memory and proportionally a lot more training since there is an increased number of weights that need to be tuned.

One solution to this is pooling, and the concept is quite simple, we resize an image to a smaller image, like so:

5.mnist.005
Figure 5. Pooling applied to an image makes a smaller image

We scan over the input matrix and summarize the information into a single number that we store in the output matrix, like so:

max pool
Figure 6. Pooling represented as a matrix operation

Note

Does a smaller image still have enough information so we can extract meaning, it’s not a question of size or resolution, it’s a question of information? How can we summarize the information, so it takes up less memory but is still useful for downstream parts of the Neural Network.

The stride determines how many boxes to skip per iteration through the source matrix. There are several functions we can apply to the source numbers, max is a good one, whatever the max number in the input matrix use that for the output matrix, but you can use others, like average for instance.

The TensorFlow.js API has a list[2] of pooling layers you can create. To create a max pooling layers like the one described above we use layer declaration like so:

tf.layers.maxPooling2D({ (1)
    poolSize: 2, (2)
    strides: 2 (3)
})
1 This is a pooling layer that uses the max function to summarize the information.
2 The input window; in this case, it will be a 2x2 matrix.
3 How the input window will iterate across the input matrix will move two cells across each iteration.

The above will result in an output image, which is half the width and height, essentially reducing the image size and memory requirements by 1/4.

A more detailed and thorough explanation of pooling can be found in this article What is max pooling in convolutional neural networks?[3]

Summary

A human brain determines what’s inside a picture by looking for features. We have a complex set of filters in our mind which processes images as they come in, throws away most of the information, and gives us summaries to our mind, which decides what we are looking at. CNN’s are doing the same thing; the convoluted features extract information from the input sources to highlight the bits we need to understand. This is one reason why CNNs are used in ML so much for image processing and sound processing; they mirror the ways our brains work.

It’s a sophisticated methodology to understand, we’ve barely scraped the surface, but I think you can see the power of TensorFlow.js is that with a few lines of code you can create a very powerful model.


1. An Interactive Node-Link Visualization of Convolutional Neural Networks https://www.cs.ryerson.ca/~aharley/vis/conv/
3. What is max pooling in convolutional neural networks? https://www.quora.com/What-is-max-pooling-in-convolutional-neural-networks


Advanced JavaScript

This unique course teaches you advanced JavaScript knowledge through a series of interview questions. Bring your JavaScript to the 2021's today.

Level up your JavaScript now!