Understanding Convolutional Neural Networks: Making a Handwritten Digit Calculator | Keras #5

Understanding Convolutional Neural Networks: Making a Handwritten Digit Calculator | Keras #5

In this video, I will be using a convolutional
neural network, implemented with Keras and written in Python, to recognize handwritten
digits and perform basic operations between them. I simultaneously aim to give you an
intuition for, and understanding of, convolutional neural networks and their awesome potential.
I will start off with a demonstration of the program. I enter “13” one digit at a time,
clicking “save image” after each digit. Then I click “multiply”, and enter “14”,
one digit at a time, clicking “save image” after each digit. When I click “equals”,
I get the correct solution of 182. Pretty cool, huh? The core of this program is a convolutional
neural network. Convolutional neural networks are loosely based on the manner in which all
mammals perceive the world around them. This manner involves a hierarchical series of feature
recognitions, starting off with simpler features like diagonal lines, curved edges, etc., and
progresses towards complex or abstract recognitions, like combinations of shapes, and finally,
the classification of entire objects. That’s pretty simple, but how would you mathematically
model this process? Let’s dive deeper into how convolutional neural networks do what
they do. Like the standard neural networks discussed
in previous videos, convolutional neural networks involve input layer neurons, weights, biases,
hidden layer neurons and output layer neurons. However, they are also a bit different. Let’s
begin at the start of the convolutional neural network and work our way through it. Inputs
are a 2-d matrix for black and white images, and a 3-d tensor for colored images. BTW,
tensors are just arrays with dimensions that are higher than 2, and the 3rd dimension in
a colored image is for the different color channels, usually red, green, and blue in
an RGB image. In our case, images are black and white. Next, weights are arranged into
2d matrices. This differs from a regular neural net, where inputs and weights are scalars,
or just single numbers. In a regular neural net, you simply multiply inputs by weights.
How does this work with matrices or even tensors? Instead, the dot product is computed. Essentially,
to perform the dot-product between the image, which is a matrix, and a smaller matrix, which
is the weight matrix and is often called a filter, you multiply the filter by each corresponding
value in a section of the image, sum these values, compute the average by dividing by
the number of values added, and place this in the corresponding position in the resulting
matrix. You then repeat this process, sliding the section of the image that is multiplied,
or the receptive field, until all the values for the resulting matrix are found. Note that
this resulting matrix loses one pixel at its edges in every direction, and programmers
sometimes perform padding, in which the lost pixels are replaced with zeros. When performed,
padding could make the resulting matrices 16*16 instead of 14*14. Also note that initially,
like any other weight, filters contain random values and are adjusted in training. Performing
these repeated dot-products in this manner is known as convolution. Convolution plays
a central role in a convolutional neural network, hence, the name convolutional neural networks.
To understand why, let’s take a look at two examples, one simple and one slightly
more complex. I’ll start with the simpler example. Say that this is the input image,
and we apply this filter to it. This filter can be thought of as a low-resolution horizontal
line, and look what happens in the resulting image: horizontal lines are kept, and almost
everything else fades into a dark gray and is ignored. In this way, filters keep what
is relevant and ignore everything else. ‘What is relevant’ is adjusted based on the needs
of the neural network in training. Mathematically, this occurred because the resulting value
in the matrix was large if and only if the receptive field was very similar to the filter.
If they were not similar, then values in the image matrix were multiplied by values close
to or equal to zero in the filter matrix or vice versa, making the resulting value small.
So far in this video, filters have contained only positive values. This, however, is a
simplification. Like in standard neural networks, weights in a convolutional neural network
can contain negative values. For these two examples, when visualizing any values in filters,
including negative values, I set the lowest value in the matrix to black and the highest
value in the matrix to white. While visualization is a powerful tool, it is also important to
consider the numbers, because the numbers are all that the computer sees. In this next
example, I will use negative values in the filter and a more complex image. This is the
new filter. While it looks exactly the same as the filter that was just used, it contains
-1s where there were once zeros. The image itself is much more complex, with many details.
In the resulting convoluted image, the horizontal edges are highlighted and most other information
is lost. The general principle of how certain values are kept or even amplified and others
are ignored through convolution remains the same–that is, if the filter and the receptive
field are similar, then the resulting value is large, and if they are not, then the resulting
value is small. However, negative values differ from zeros in an important way. With the filter
that was used in the first example, the values in the receptive field corresponding to the
zeros in the filter don’t matter–they can be zero, or any other value, and since any
value multiplied by zero is zero, the resulting value will be zero. So, with this filter,
you will keep any horizontal line, regardless of what values are above or below it. When
the zeros are replaced with -1s, however, the filter will preserve a horizontal white
line if it has low, or ideally, negative values above and below it. This is because if the
values above and below the line are large, then large values will be made negative, making
the resulting sum much smaller. Similar logic can be used to explain why a horizontal black
line surrounded by large values will be preserved. Hence, this filter may look for horizontal
lines making up an edge, if the edge has a height of one pixel and is surrounded by values
that contrast the line. This fact explains the emphasis on edges rather than any horizontal
lines in the resulting convoluted image. In this way, negative values in filters allow
the model to emphasize a wider variety of features. In a convolutional neural network, it is almost
always advantageous to have multiple filters, in order to highlight multiple different features
separately, but more on how this plays out later. The next step involves making the output,
after applying the filter, have a lower resolution, but still retain the most important features.
The point of this is to make the neural network more efficient and less computationally expensive,
especially during training. This step is pretty simple and is called pooling. First, you define
a pool size. This will be the dimensions of a section of the matrix that you will compress
into one value. Perhaps, the most popular method of this ‘compression’ is simply
taking the maximum value, where the value represents the pixel intensity. In a black
and white image, which is what we will deal with, the value represents how white a pixel
is. So, the network simply slides across the image, taking the maximum value in a certain
area. Note that for both convolution and pooling, there is something known as a stride length.
Stride length represents how much you move the window after performing convolution or
pooling, in both the horizontal and vertical directions. For convolution, the stride length
is usually 1 by 1 or 1 horizontally and 1 vertically, and in pooling, the stride length
is usually the same as the pool size, which is the area from which the model takes the
maximum value. Changing the stride length from the standard values is usually unnecessary
and importantly, will change the dimensions of the resulting matrix. Finally, we apply
an activation function to all the values in the now compressed matrix. Remember, an activation
function serves the purpose of allowing a neural network to approximate a non-linear
function and this remains the case in convolutional neural networks. By the way, you could technically
apply an activation function before pooling and directly after convolution, however this
would be slightly more computationally expensive, as the matrices will be larger. These 3 parts,
or convolution, pooling, and activation can be repeated to further add complexity to the
model. Note, that you don’t necessarily have to implement these parts precisely in
this order. As an example, it may prove beneficial to perform convolution, activation, convolution,
and only then pooling. What happens when you add more convolutional layers is really cool.
Say that I start with a 28 by 28 black and white image and I have 32 filters in my first
convolutional layer. So, I now have 32 26 by 26 matrices, assuming that I don’t perform
padding. Whether I apply pooling and activation is irrelevant to this scenario. Let’s say
that I don’t, and I add another convolutional layer with 32 filters. I actually don’t
get 32 times 32 or 1024 new matrices after I apply this convolutional layer. So, I don’t
apply 32 filters to each of the previously filtered images. Let’s break down what actually
happens. Each of the 32 filters will actually have
a depth of 32, corresponding to the depth of 32 in the previous layer’s outputted
tensor. From this point, the dot product is calculated between the receptive field in
the first matrix and the first matrix in the first filter, between the receptive field
in the second matrix and the second matrix in the first filter, between the receptive
field in the third matrix and the third matrix in the first filter, and so on. Then the average
of all these values is computed is placed in the top left position of the first resulting
matrix. Then the rest of the values in the first matrix are computed by sliding the receptive
field across the matrices, computing the dot product, computing the average value, and
placing it in the corresponding position in the first resulting matrix. So, the first
resulting matrix was generated with the first filter. This process is repeated with each
of the filters, until there are 32 resulting matrices. Essentially, 2d convolution between
a matrix and a filter is repeated to account for the depth of both the tensor and the filter,
the average of these values is computed to make the result a 2d matrix, and this process
is repeated with each of the filters. By performing convolution and averaging the
values in this way, I combine the highlighted features in a computationally inexpensive
way. By doing this, we add a level of complexity that will force the model to make progressively
more and more intricate feature recognitions or highlights as it finds the optimum filter
values deeper in the convolutional neural network. So, deep in a CNN, where convolution,
pooling, and activation has already been performed quite a few times, a filter that looks like
a horizontal line, for example, will likely result in the highlighting of a much more
complex feature in the original image. Generally, by the last convolutional layer, each filter
will represent a relatively complex feature. I say “relatively” because depending on
the scenario a curved edge could be a complex feature, like in our case, or an entire human
face could be a complex feature. While all this may be slightly complicated, it is important
to remember that at a high level, we are still just computing a product between an input
and a weight, and because of this, eventually allowing the neural net to find the relationship
between input and output. Then, you flatten all the pixels into a single
vector. Remember, a vector in programming is a one-dimensional array. You connect each
pixel to a neuron in either another hidden layer or the output layer. Remember, that
by this point in the convolutional neural network, a pixel should represent the presence
or absence of a relatively complex feature. In this way, we add a fully connected section
to our neural network. This is done in order to take advantage of all the different features
that have been learned. In other words, the section of a Convolutional Neural Network
with convolution, pooling, and activation performs feature extraction and in the fully
connected section, the network finds the relationship between the presence or absence of these features
and the different classes. Like in the last video, with breast cancer diagnosis, the output
will be a set of probabilities. These probabilities represent the chances that a particular image
belongs to a certain class, according to the model. You can then take the digit corresponding
to the highest probability and voila, you have a predicted digit. Initially, as with
the neural networks dealt with in previous videos, the predictions are inaccurate. However,
accuracy is gradually improved by minimizing the loss or cost during training. For training the model, I use the MNIST dataset.
This dataset is popular and widely used among programmers. It contains 70000 handwritten
digits that are already split into training and testing data, with 60000 training digits
and 10000 testing digits. Each digit is a 28 by 28 pixel, black and white image. Each
of these images obviously contains a corresponding correct, human-made label. Let’s begin writing the CNN in Python with
the help of Keras. First, as always, I import all the packages that I will need in order
to implement the CNN. Then, I load the MNIST dataset. From where? Actually, Keras has a
copy of it because of how popular it is. Also, as I previously said, the dataset is already
split into training and testing data, so all I have to do is assign it to X_train, X_test,
y_train, and y_test. Remember, that X_train and X_test are the images, and y_train and
y_test are the classes or labels for the images. Next, I reshape both X_train and X_test to
-1 by 28 by 28 by 1. What in the world does that mean? Well, 28 by 28 is simply the resolution
of each of the images, and 1 represents the single color channel. The -1 is a special
number that essentially tells Keras to figure out what the actual value is. The actual value
will be the number of images assigned to that variable. So, the only thing that will actually
change in this line is the addition of the one color channel, and this is added to match
the format expected by Keras. Next, I apply to_categorical to y_train and y_test. For
a complete description and explanation of what this does, see my breast cancer diagnosis
video. Basically, it converts the classes in y_train and y_test into a form readable
by Keras. In the following lines, I normalize the input data by taking each pixel intensity
value, which is currently between 0 and 255, and dividing by 255. In order to do this,
I must first change the type of the values in these numpy arrays to float32 so that they
can contain decimals. Then, as always, I define the model in the line model=sequential().
From here, I define the input shape as 28 by 28 by 1. Note that since we only input
one image at a time, the input is 3 dimensional, whereas X_train and X_test are 4 dimensional,
with the 4rth dimension being the image index. In the same line, I define 32 3 by 3 filters,
and even though I don’t explicitly write it out, Keras includes a bias since this is
the default. I also don’t choose to change the default of no padding after convolution,
simply because it is not necessary. Next I add a pooling layer, and more specifically,
a max-pooling layer with a pool_size of 2 by 2. There are actually other types of pooling,
like average pooling, for example, in which instead of taking the maximum value in a defined
area, you take the average value. In truth, there isn’t all that much of a difference
in the resulting performance between the two, especially for a simpler example, like the
one that we are dealing with. As mentioned earlier, you can change the stride length
in both pooling and convolution from the default, however, this is unnecessary. Hence, I use
the default values. Then, I apply the relu activation function. After this, I flatten
all the matrices into one very long list of values, where each value represents a pixel.
Then, I add a dense layer with 128 inputs. So, the value of each pixel will be connected
to each of the 128 neurons in this layer. This connection is often computationally expensive,
especially with much higher-resolution images. I also apply the relu activation function
to each of these neurons. Finally, I add an output layer with 10 different neurons. Each
of these neurons will represent a class, or, more specifically, a digit from 0 to 9. I
apply the softmax activation function, which is used in multiclass classification problems
like this one, as discussed in the last video. Finally, I compile the model. I use categorical
crossentropy as the loss, which is used in tandem with the softmax activation function,
I use the Adam optimizer, which will dictate how the weights are updated during training,
and I state that I want to keep track of the accuracy metric. Now we have defined the structure
of our convolutional neural network. Next, I perform training in the line model.fit.
I pass X_train and y_train_cat, and I define the batch size, epochs, verbose, and validation
split. For a complete explanation of all of these, see my neural network regression video.
Finally, I evaluate the model and find that the accuracy is good. And that’s it. I have
now written a very simple Convolutional neural network in Keras. With this simple dataset,
there is very little that I can do to improve the model’s performance, let alone by a
substantial amount. For example, I could add more layers to the model, or I could implement
dropout, however, none of this is necessary and again, it doesn’t have a substantial
effect on the model’s accuracy, which is already quite high. The next section of the code is responsible
for the GUI, or graphical user interface, that you saw at the start of this video, and
for the operations performed between the numbers. Basically, here’s how it works. First, the
program opens a window, into which the user can write with their mouse. Whenever the user
clicks, holds, and drags, the program will plot circles in a location corresponding to
the position of their mouse. The program keeps track of every location in which a circle
was plotted, so that an exact copy of the displayed image can be made, that will be
inputted to the convolutional neural network, when the user clicks ‘save image’. Before
being inputted, however, the image is resized to 28 by 28, or the dimensions of images in
the MNIST dataset. The program also makes the pixel intensity values float32 variables
and divides each of them by 255, just as we did when we trained the network. Then this
slightly-modified image is inputted to the network and the program takes the digit with
the highest probability in the network’s output. This digit is then displayed. If the
user clicks on the ‘click here if the number is incorrect button,’ then the program will
clear the current screen and will later replace the digit with a new one, when entered by
the user. A user at this point can either enter the second digit in a number, in which
case the process will be repeated and the two digits will be joined together, or the
user can click an operation. When the user clicks an operation, the program stores the
name of that operation in a variable. After this, the initial process is repeated and
the user can enter another number. When they click equals, the program simply performs
the requested operation between the two numbers and prints the result. If the user happens
to click the ‘reset’ button at any point in this process, then all the existing information
will be cleared or overwritten, and the process will start over again. And that’s it. If
you want to try the program out for yourself you can download the code on my github, for
which there is a link to in this video’s description. Note that if you do so, you will
need to install pillow in your virtual environment. This is done with the following command. And
that’s it! We now have a basic, functional calculator that can recognize handwritten
digits and perform simple operations between them. In case you didn’t already know, this video
is part of a series in which I find really cool datasets like this one, and for each
of them I show you how to implement a Neural Network. In doing this, I hope to both entertain
and educate you. Subscribe and click on the notification bell to be notified when I release
a new video, and also hit the like button and leave a comment if you want this video
to reach more people. Thanks for watching.

Author: Kevin Mason

32 thoughts on “Understanding Convolutional Neural Networks: Making a Handwritten Digit Calculator | Keras #5

  1. Is tensorflow only work in 64bit python ? I have conda installed(64) in my 64bit win 10 but python is running there is 32bit and i am getting err while doing pip install tensorflow.

  2. Minor corrections: (1) Padding is performed before convolution, not after, as I made it seem at 2:57. (2) At 1:40, I say that tensors are arrays with dimensions that are greater than 2. This is not entirely correct. Tensors can be 0-dimensional, 1-dimensional, and 2-dimensional, although tensors of these dimensions are more often called scalars, vectors, and matrices, respectively. Beyond 2 dimensions, 3-dimensional arrays are called 3rd order tensors, 4-dimensional arrays are called 4th order tensors, and so on.

  3. Hello, I am also young and invested in AI. I am creating a discord community of like-minded individuals to share ideas. Your knowledge would be very useful. If you are interested in joining please message me on discord: CaptainAdd#6203

  4. awsome video , can you explain by any chance what happens when the first conv layer has 32 filters and the next one 64 for example? Most ppl do that

  5. 15:27 Isn't -1 just used basically to "unravel" the image matrix into one long vector as the matrix is represented as nested arrays in python? This was always my understanding of the use of -1 . If someone can correct me that'd be great.

Leave a Reply

Your email address will not be published. Required fields are marked *