Recurrent Neural Network – The Math of Intelligence (Week 5)

Recurrent Neural Network – The Math of Intelligence (Week 5)


Hello, world. It’s Siraj and today we’re going to generate words, so given some book or Movie script or any kind of text corpus you can it’s plug in play So you can give it any? Kind of text corpus it will learn how to generate words in the style of that corpus of text and in this case We’re going to give it a book the book is called metamorphosis by Franz Kafka Which was a really crazy weird writer from the 20th century anyway? He’s cool. Dude anyway We’re going to generate words in the style of that book and this can be applied to any text any type of text It doesn’t just have to be words it can be code it can be ht know HtML Whatever it is, but that’s what we’re going to do. No libraries. Just numpy so I’m going to go through the derivation the forward propagation Calculating the loss all the math so get ready for some math put on your linear Algebra and calculus hats okay? So this is kind of what it looks like this first image here And I’m going to actually code it as well, so it’s I’m going to I’m not just going to glaze over I’m going to code it so we can see the outputs as I go, but the very end part will be I’m going to code the important parts Let me just say that okay. So okay, so check it out given some text. Corpus. It will predict the next character So what you’re seeing here? Is it actually printed it predicting the next word, but we’re going to do a character level recurrent network So that means it’s going to generate character by Characters okay, so character by character by character not word by Word by Word okay, so It’s going to be trained for a thousand iterations, and the more you train it the better It’s going to get so if you leave this thing running overnight on your laptop Then by the time you wake up, it’ll be really good however I wouldn’t recommend training it on your laptop as my Song says I train mom models in the cloud now because my laptop takes longer right so what is a recurrent network Right, what is this? What is this thing? We’ve talked about feed-forward networks. I’ve got two images here of feed-forward networks The first image is the most popular image right, it’s that Really funky looking neuronal Architecture, but it’s you know it can be kind of confusing if you think about it because it’s not like these neurons are classes and These classes have links to all of the the other neurons like it’s some kind of linked You know massive crazy linked list kind of thing no, it’s or tree like thing. It’s not really like that What’s really happening are a series of Matrix operations, so these? Neurons are the output of a series of Matrix up these neurons are actually just numbers that we then activate with an activation function So a better way of looking at it would be as a computation graph a more Mathematically sound way of looking at it So if you have some inputs and you know the input could be anything what you would do is you would multiply the input by the weight Matrix add a Bias value And then activate the result of that And that would be your output that you then feed in to the next layer a layer That what you see as these neurons a layer is actually just the result of a dot product operation Followed by adding a bias value if you want to add a bias which you should in practice You should add a bias a built neural networks without biases before for examples But you really should add a bias and I’ll talk about why in a second But you should add a bias value and then you activate the output of that and by activate I mean you take the output of that dot product plus bias operation the output of that and you feed it into a an activation function A Non-linearity whether that’s a sigmoid or tan h or rectifies linear unit and the reason we do that is so that our network can learn? both Linear and Nonlinear functions because Neural Networks are universal function approximator x’ if we didn’t apply an activation function to it It would only be able to learn linear functions We’re going to learn Nonlinear and linear functions And that’s why we apply an activation function to it now a great way to remember this Whole thing is to just rap about it, so input Tom’s weight add a bias Activate repeat here. We go input sing with me x weight add a bias Activate repeat, and you just do that for every layer. You just repeat that process Okay, so people or networks They’re great for learning an input-output Pattern what is their what is the rule here between these of inputs and a set of outputs? Right and in the end a feed-forward network and in fact all neural networks. It’s just a 1 it’s just one big Composite function. What do I mean by that? I mean that you can think of a neural network as a giant function and inside of that network are smaller functions Nested functions a composite function is a function that consists of other functions What do I mean by nested functions remember this computation graph that we just looked at right here? These are all functions each layer is a function right input times weights add a bias Activate that is a function that you feed the output of to as the input to the next function. So what a neural network is so the Most nested function right in the middle is this most nested function is the first layer value whose output We then feed to the next layer which would be the next function whose output we feed to the next layer so the largest Function the the most the the function on the on the outside here is then the output layer because we’re feeding it the inputs output The output of what of all that that chain of computation that already occurred, right? So that’s what that is, so it’s a composite function And we would use P4 nets anytime we have Two variables that are related temperature location hide and wait car speed and brand these are all mappings But what if does has done? I’m just adding my own sound effects don’t add um what if the ordering of the data mattered, right? What if you had stock prices right? It’s a very controversial topic, but I just you know the stock price thing gets a lot of views Even though I you know I don’t want to I don’t want to I don’t really personally care about finance data But I know some of you guys do and you know I’ll probably talk about it more in the future anyway Tangent back to this what if the time matters right so stock prices are a great example of when time matters? You can just predict a mapping between time and the price what happened before what the stock prices before are what matter to the current stock price that’s actually available in the in the context of stock prices But it applies to all time series data so video right if you want to generate the next frame in a video it Matters, what frames came before it? You can’t just learn a mapping between a frame and the time that that frame shows up because then what happens is Given some new time you can just generate a frame based on nothing else right it depends on the frames that came before it you see what I’m saying the sequence matters the sequence Matters here the alphabets or lyrics of a song you can just generate a Lyric or an alphabet depending on the index that it’s at you’ve got to know what came before it try to recite the alphabet backwards and so to get into neuroscience for a second try to recite the alphabet backwards it’s hard right Zui x W ki Qu okay? See I can’t even do it right now I’m not going to edit that out So or any song try to recite a song backward you can’t because you learned it in a sequence It’s a kind of conditional memory what you remember depends on what you’ve what you’ve stored previously, right? It’s conditional memory in that way and that is what recurrent networks help us do they help us compute Conditional memories they help us compute the next value in a sequence of values So that’s what recurrent networks are good at that’s what they’re made for and it’s not like it So there’s some new technology recurrent networks are invented They were invented in the 80s neural networks were invented in the 50s, but why is this super hot right now? Why are you watching this video because with the invention of bigger Data and bigger computing power? When you take these recurrent networks and give them those two things They blow almost every other machine learning model out of the water in terms of accuracy, so it’s just incredible anyway this is a picture of a three-layer and they could This is a picture of a three-layer recurrent network right, so you’ve got your first layer Which is your input your hidden state your output layer, and so that would just be a feed-forward network But the difference here is that we’ve got this other Layer right here, and that is so what that what that other layer is it’s not actually another layer the difference Is that we’ve added a third weight Matrix. So we’ve got our first weight made our second wave Matrix We’ve got a third wave Matrix and that’s really what makes it different than a feed-forward network Is that we’re adding a third wave Matrix and what the third weight Matrix is doing it is connecting the current hidden states So the hidden state at the current time step to the hidden state at the previous time step So if they were current weight Matrix and you’ll see programmatically and mathematically what I’m talking about But that’s really the key bit here for recurrent networks. That’s what makes it unique from Feed-Forward networks and so what this does is whenever we feed in some value You know because we are Training this network right we continuously feed it new Data points data point after data point from our training set but for feed-forward Networks we’re only feeding in the input. We’re not feeding in that the previous hidden states We’re only feeding an input after input after input and the hidden state is being updated at every time step But because we want to remember a sequence of data we’re going to not just feed in the Current Data point wherever we are we’re also going to feed it in the previous hidden state and that is And by that I mean the values that are computed from the previous time step for that hidden states right that set of numbers that Matrix and So you might be thinking wait a second? Why don’t we just feed in the input and in the previous input as well from the previous time set why are we feeding in? The input in the previous hidden state because input Recurrence only remembers what just happened that previous input, but if you feed it in the previous hidden state Then what’s happening? Is it can remember that sequence? It’s not just about what came right before it it can remember everything that came before it because you can think of that hidden state as a kind of like a Like think of it like clay that’s being molded by every new input. It’s being molded by every new input and by feeding that clay that’s being molded back into the network it’s it’s being a It’s learning neural memory, so it’s a form of neural memory conditional memory, and it can remember sequential data so Right so here’s another example just to give a few more examples before I go into the code here But we have so this this is a very popular type of image for recurrent networks So what’s happening? Is it’s we’re feeding in the current input Calculating a hidden state and then computing an output and then for the next time step We’re giving it that new Data points as well, and so the blue arrow is what’s different here compared to a feed-forward network We’re feeding in the previous hidden state as well the input to compute the current hidden states to compute our output our y value and we’re using a loss function to improve our network every time and So if you think of what that recurrence looks like it looks like this so remember that feed-forward network We just looked at the difference here. Is that we are feeding in the output of the hidden state back into the input the output of this wait times bias Activate operation that’s in a layer, okay? So of the formula for a recurrent network looks like this which basically says that the current hidden state Ht is A function of the previous hidden state and the current input okay and the theta value Right here are the parameters of the function so the network learns to use h of t as a lossy summary of the task relevant? Aspects of the pass sequence of inputs up to t The Loss function that we’re going to use here is going to be the negative log Likelihood, okay? This is a very popular loss function or recurrent networks look like plain old recurrent networks Not using anything fancy like long Short-term memory cells or bi-directional Capabilities, but the negative log Likelihood usually give us the best output or the best accuracy for plain old Recurrent networks which is why we’re going to use it and we’ll talk about what that consists of in a second but our steps for this are going to be the first initialize our weights randomly like we always do and Then give it then we’re going to give the model a char pair so what is the char pair the Char pair is going to be the input char so that’s some seed some some letter from the training text that we want to give us input as Well as a as the target char and the Target Char is going to be our label So our label is actually the next char, so if we take the first two Chars from some input texts from some Korkis Let’s say the word is the the input Char would be t then the target Char would be h so given t? we want to predict h so you see how that that target Char acts as our as our label that we’re trying to predict and So once we once we have that h we can compute the most likely next character and then compare From our forward paths we’re going to calculate the probability for every possible next Char given that t According to the state of the model using the parameters and then we’re going to measure our error as a Distance between the Previous probability value and the Target Char so that that’s that’s what axes are in our our label the next char in the sequence And we just keep doing that so it’s a dynamic error right and so we once we have that error value We’ll use it to help us calculate To help us calculate our gradients for each of our parameters to see the impact they have on the loss and that is back propagation Through time and we call it through through time because we are using that that hidden states a hidden state Matrix that recurrent Matrix value, but otherwise, it’s just the same. It’s just back propagation It’s called through time because we are applying that some Hidden Say 2 hidden state Matrix to it ok so then once we have our gradient values We’re going to update all the parameters in the direction via the great in the right direction to minimize the loss That’s great in the sense V are gradients, and we just keep repeating that process So everything is the same here gradient descent as a fee for Network gradient descent? calculating an error value a Forward pass for the difference is that we are connecting the current hidden states to the previous hidden state and that changes how? the Network learns So what are some use cases? I talked about time series prediction specifically weather forecasting yet stock prices traffic volume sequential data generation as well music video Audio any kind of sequential data What is the next note the next Audio waveform the next frame in the video ok and then for other examples? I’ve got great one a great one here for binary audition That was originally invented by track who is a great technical writer Definitely check that out and so once we understand the intuition behind recurrent networks then we can move on to LS TM Networks and bi-directional networks and recursive networks those are more advanced Networks and that you they solve some problems with recurrent networks before you get there You’ve got to understand recurrent networks, okay? So this code contains four parts the first part is for us to load the training data then we’ll define our network Then we’ll define our loss function and the loss function is going to contain both the forward pass and the backward pass so the real meat of the code is going to happen in the loss function and what it’s going to do is it’s going to Return the gradient values that we can then use to update our weights later on during training but the meat of the code is going to happen in the loss function and once we We’ve defined that will write a function to then make Predictions which in this case would be to generate words and will train the network as well ok so our first step is going to be load up our training Data, so to load up our training data the To load up our training data. I’m going to say ok so let’s define what that what that data is by the way So if we open this file, we’ll look at it Casa Calcutta TsT. It’s one morning When Gregor Samsa woke up from trouble dreams with a right? So this is just a book it’s a big book a big txt file That’s what it is the input is going to be ok so We’ll open it up using the native functions here of python and It’s going to be recursive because we want to we want all of it we’ll just read that simple plain txt file, and then we’re going to say ok let’s get that a List of Data points and our Chars in this case We’ll store it in charge, and we’ll define how big our data is as well as our vocab our boat capsized and we can say that the it’s going to be the length of the data that that big text file as well as the length of the Chars How many chars do we have and we’ll print it out for ourselves just so we know how many cards are our and once we’ve done that we can go ahead and Print it out And it’s going to tell us how many unique char there are which matters to us because we want to make a vector of the size of the number of Chars that there are so let me go ahead and print that out and it’s going to tell us exactly what the deal is and so we’ve got a That’s how many characters. It has okay, so the data has 137 K characters, and eighty-one of them are unique, okay good Good to know good to know our next step is to calculate the vocab size Okay, so we’re going to calculate the boat capsized because we want to be able to feed vectors into our network We can’t just feed in raw String, you know Chars We’ve got to convert the Chars to vectors because a vector is an array of a float values in this case or a vector is an array a list of numbers in the context of Machine learning and so So we’ll calculate the vocab size to help us do this so we’re going to create two dictionaries and both of these dictionaries are going to convert the both of these dictionaries are going to convert the Characters to integers and then the integers to characters while respectively one we’ll convert from character to integer Which is the one that I’ve just written and then the next one is going to say let’s convert the integers to Characters and Once we’ve done that we can go ahead and say well Let’s print all of the values that it’s it’s storing because these are our dictionary that we’re going to use in a second to convert our values into vectors, so let’s go ahead and print that and What’s the deal here? Oh and numerate write a new? breaks great great Right so here are our vectors right there It’s a dictionary or here here are dictionaries one for characters two integers and one for integers two characters, okay? So once we have that now so we’ve done that already and so then we’re going to say let’s Create a vector for character a so this is what vectorization looks like for us So let’s say we want to create a vector for the character a so we’ll say it will initialize the vector It’s empty so it’s just a vector of zeros of the size of the vocab okay? and so of the size of our vocab and then we’ll say okay, so so convert the Knot now, we’re going to do the conversion. We’ll say Char to integer so a to the integer So that’s so that’s going to be our input it’s going to give us an integer value and We’re going to set it to one and so what happens is when we print out this vector It’s going to be a vector of size Let’s see if I got that right there’s going to be a vector of size Hold on all right important um pie. I forgot importing numpy. Yeah, okay Right so it’s a vector of size. How many unique characters were there there were? 81 Unique Characters so the vector of size 81 and all of those values all those elements in the vector are going to be 0 Except for the one that is The mapping between a and its respective integer in that dictionary, so that’s how we mapped it That’s why we created those two dictionaries So this is what we would feed in as a so we will feed it in two of these because remember I said that we Have a char pair. So we’ll feed an a and whatever the next character is as Our input, which would be our input in our label value and the label is our other character our next character Ok so then for our model parameters. We’re going to define our network remember it’s a 3 layer network We have our input layer our hidden layer and our output layer And so all these layers are fully connected so that means every value is going to be connected to every other value Between layers ok so the way we’ll define that is 2 Well first, let’s define our hyper parameter, but I got it We got to define our tuning knobs for the network so we want to say that our network and I have a hundred hidden Neurons words for it Neurons for it’s hidden layer, and then we’re going to say that we want There going to be 25 characters that are Generated at every time step that’s our sequence length and then our learning rate is going to be this very small number Because if it’s too slow then it’s never going to Converge, but if it’s too high then it will Overshoot and it’s never going to converge the learning rates by the way is a is How quickly a network abandons old beliefs for new ones? So if you’re training your neural network on cat dog images the lower the learning rate the less likely it will be to When given a new dog picture if your just been trying it on cat pictures if you give it a new dog picture the less Likely, it’ll be to consider that as a part of the training data You’re kind of be able to recognize both the lower the learning rate the more likely it will consider that dog picture Just an anomaly and just kind of discard that internally so it’s a kind of way of to tune how Quickly a network diao abandons old beliefs for new ones that’s another way of putting it Anyway, so that’s for our hyper parameters Now we’ll define our model parameters right? We’ve defined our model parameters, and now we can define our Network’s wave values so the first set of weights are going to be from our inputs so x so w x h so x is our input, this is what the Terminology is right so the weights from our input to our hidden states, right? so that’s going to be initialized randomly using the numpy Random random Brand and function and it will be with value between the hidden size that we’ve Defined and the vocab size because those are the two values That we’re dealing with here, and we’ll multiply it by 0.01 because we just want to scale it for a character level recurrent network because it’s a character level recurrent network so input to hidden state and So then we will Repeat that process, but this time for our not from our input to hidden but for our hidden state to our next hidden state and So that’s our recurrent weight Matrix right there that’s our current weight Matrix and so lastly we’ll have our third Matrix which is our What’s what’s our third way Matrix our third wave Matrix is our Hidden states to our outputs value our output and so that’s going to be vocab size to between between the vocab size and the hidden size and Then we will also since we have two bias bias these will say The bias for Hidden state will be initialized as a set of zeros Of size of the hidden size because it’s for our hidden state, and then we will So that’s our hidden bias and one more bias and that is for our our Output by that is our output bias also a collection of zeros The difference here is that is of the vocab sighs? Okay, and so yeah great. Oh Let’s see what we got here insides is not defined in size is right here compile compile What’s still hidden size is not defined? Yes, it is Yes, it is Invalid Syntax crate so the function is going to take as its input a list of input Chars a list of Target Chars and the previous hidden state and Then this functions going to help put a loss a gradient for each parameter between layers and in the last hidden state So what is the forward task so the forward path in a recurrent network looks like this this function describes the forward pass or at? This function describes how the hidden state is calculated? right, so So so how is the forward task calculated so the forward task is remember? It’s just a series of Matrix operations, so this is this is basically our forward path right here what you’re looking at right here So this first equation right here is What is let me make this smaller so you can see so the way we compute this math operation right here? This is the forward pass is the dot product between the input to hidden state weight Matrix and the input data That’s this term right here plus the dot product between the hidden states the Hidden State the Hidden State Matrix and The Hidden States and then we add the hidden bias and that’s going to give us the hidden state Value at the current time step right so that’s what that represents, and then we take that value and we feed it We compute a dot product with the next weight Matrix and that is the hidden state to the output and then we add that out that output Bias value And that’s going to give us the unnormal Unnormalized log probabilities for the next charge which we then squash into probability values using the this this function p Which is actually right here p right here, but I’ll talk about that in a second, okay So that’s our forward pass and then for our backward pass the backward pass is going to be Before I talk about the backward pass. Let’s talk about the loss per second so the loss is the negative log, likelihood, so it’s the negative log value of p and p is this function here, so Which is represented programmatically by this? right here, right, so it’s the it’s e to the x where x is the output value from the that it received divided by the sum of all of the E to the probability values Okay, and that’s going to give us p a p-value, okay And so we take that p-value and then we take the negative log of that p-value and that is our loss Scalar that lost scalar value, and so once we have that loss. We’re going to we’re going to perform back propagation using that loss And so the way we compute back propagation to go over. This is by using the chain rule So the chain rule is from calculus what we want to do is compute gradient for each of the layers, okay? So for each of the weight Matrices okay given an error value We’re going to compute the partial derivative of the error with respect to each weight Recursively, so the reason we’re using the chain rule is that so so because we have three wave matrices We have the input to hidden hidden to output and hidden to hidden we want to compute gradient values for all three of those so that’s what this looks like We want to compute gradient values for all three of those weight matrices and the way we’re going to do that is to compute our Loss using the negative log Likelihood and use that laws to compute the partial Derivative with respect to each of these weight Matrices and once we have those though. That’s our gradient value That’s the change that’s the Delta We can then update all three weight Matrices at once and we just keep doing that over and over again so our First gradient of our loss is going to be computed using this using this function So compute p minus one and that’s going to give us our first gradient And we’re going to use the chain rule to backward pass that gradient into each Into each weight Matrix so let me talk about what I mean by this so the chain rule so remember remember How I said Neural networks are giant composite functions, right? it’s a giant composite function and what the chain rule lets us do is that lets us compute the Derivative of a giant of a function as a salt as the product of derivatives of its nested functions so chain rule in the case of f of x right here if f of x is a composite function that that consists g of h of x then the chain rule would be to say well let’s compute the derivative of g of h of x times the derivative of h of x at nested function So you multiply it by the derivative of the inside function and that will give you the derivative of that bigger function? Okay, and you keep doing that for as many nested functions as you have here’s another example if I want to derive the function 3x Plus 1 to the fifth then I would say well, this is actually a function and the function is The outer function g of x is 3x plus 1 to the fifth so so we’re using the power rule. We take the Exponent value move it to the coefficient and subtract one from the exponent So then it would be 5 times 3 of x plus 1 to the fourth Times the derivative of the nested function which is 3 of 3x plus 1 and that’s the chain rule? and so then if we multiply those two derivatives together that will give us a derivative of the larger function f of x So that same logic applies to Neural networks because Neural networks are composite functions. So we are recursively moving this derivative this partial derivative value by moving I mean multiplying dot product or computing the dot product between the partial derivative calculated at the last layer And we’re multiplying it by every layer recursively going backward this will make more sense as we look at this programmatically but that’s what’s happening here, and Yeah, that’s what’s happening here So let’s let’s let’s let’s code the cell by the way the bias the reason we add a bias Is it allows you to move the thing of it as like this you know in the y equals Mx plus the equation it? Allows you to move the line up and down to better fit the Data without B The Line will always go through the origin 0 0 and you might get a poorer fit So a bias is kind of like an anchor value Anyway to Define our loss function our loss function is going to be so we’re going to give it our inputs and our targets as its parameters as well as the hidden state from the previous time step Ok so then let’s Define our parameters that we’re going to store these values in so I’m going to Define four parameters ok these are lists that going to store values at every time step in okay as we? compute them So these are empty dictionaries So x of x so x s is going to will store the one hot encoded input characters for each of the of the 25 Time steps so can Sort this will store the input characters Hs is going to store the Hidden State outputs, okay? Yx will store the target values and Ps is going to take the y’s and convert them to normalize probabilities for Chars okay, so then let’s go ahead and say H of x h oh sorry Hs the value of hS Is going to be the reason we’re copying that so check this out? we’re going to initialize this with the previous hidden state the hS currently with the previous in States and Using the equal sign would just create a reference, but we want to create a whole separate array so that’s why we don’t we don’t we don’t want hs with the element negative 1 to automatically chain if Change if h previous has changed, so we’ll create an entirely new copy of it And so then we’ll initialize our loss as 0 and then and then okay So we’ll initialize our loss as 0 so this is our loss scalar value, and then we’ll go ahead and do the forward pass So the forward pass is going to look like this ok so we’ve already looked at it Mathematically, and now we can look at it programmatically So we’ll say ok so for each value in the range of inputs so for the length of inputs Let’s compute a forward pass. So the forward pass is going to be We’re going to start off with that one of K representation We place a 0 vector as the teeth input and then inside that t input We use the integer in inputs list to set the correct value There okay, so that’s that in that second line And then once we have that we’re going to compute the hidden state now remember I showed you the equation before we just repeat that equation here And then we compute our output just like I showed before and then our probabilities of the probabilities for the next Chars Once we have our probabilities. We’ll compute our softmax cross-entropy loss. Which is the negative log, Likelihood. It’s also called the entropy you’ll actually see that in tensorflow the cross entropy as a Predefined function, but we’re computing it by hand here and so once we have the forward pass now. We can compute the backward pass We’re going to compute the gradient values going backwards So initialize empty vectors for these gradient values right so the gradient So these are the gradient are the derivatives the derivatives are our gradient? It’s the same thing here So we’re computing our Derivatives with respect to our weight values from x to h from h to h and then from h to y? and we’ll initialize them as zeros and then also we also want to derive we also want to compute partial derivatives or gradients for these of bias values for our hidden state and our output and then As well as for our next which means the next time step that the hidden state in the next time step derivatives for all of them when we do back propagation we’re going to we’re going to collect our output probabilities and then Derive our first gradient value now our first gradient value. It looks like this. Let me go back up here This is how we compute our first gradient value with respect to R With respect to our loss all right, that’s the first gradient value so We’re going to compute the output gradient Which is the output times the hint states transpose, and we can think of this so one so check this out right here So this is our first partial derivative with our for our our hidden states to y2 our output layer that Matrix and you can and so what we do is we compute the dot product between that output and the Transpose of the Hidden state the reason we use a transpose is we can think of this Intuitively as moving the error backward through the network giving us some sort of measure of the error at the output of that layer So when we compute the dot product between the transpose of some layers Matrix with the derivative of the next layer that? Is moving the error backwards it’s kind of it’s it’s back propagation because the error value it’s constantly changing with respect to every layer that it moves through and by multiplying it by the Transpose of a layer the dot product from the partial derivative with the previous layer times where we currently are it’s going to output a gradient value that that that derivative right and We’ll use that derivative later on to update other values as well, so We’re also going to compute the derivative of the output bias and then we’re going to back propagate into h so Notice how we are continuously performing dot product operations here for every single layer we have We’re also back propagating through the pan H non-linearity right, so we are computing the derivative value and this is programmatically what they do what the derivative of tan h looks like and we’re using the Computed derivatives from the previous layer that we were at at the end of the network the tail-end as we move through to the beginning as We’re using them as values to compute the dot product of the whole point of computing the dot product with these with respect to each of these layers is that we are computing new Gradient values that we can then use to update our network later on so then we use that raw value to update our hidden value And then we lastly with we compute the derivative of the input to the hidden layer as well as a derivative derivative of the Hidden layer to the hidden layer and Once we have that we can return all of those derivatives our gradient values are our change values We can return all of that now There’s also this step right here to mitigate exploding gradients which we’re not going to go into right now Because it’s not really necessary however. I will say this that whenever you have really really long sequences of input Data like Like the Bible just a huge book then what happens is as the gradient is moving by moving I mean you’re computing the dot product of it for every layer with the current weight Matrix wherever you’re at using the partial derivative The the Gradient value will get smaller and smaller There’s it’s a problem with recurrent networks that’s called the vanishing gradient problem, okay? And so it gets smaller and smaller, and there’s a way to prevent that one way is to clip those values by defining some some interval that they can that could they can reach or Another way is to use lSP m networks which we’re not going to talk about but anyway Yeah, so that’s our forward backward Paths we computed that inside of the loss function and we computer our loss as well right here using softmax Cross entropy So for as many characters as we want to generate we will do this so we’ll say The forward path is just like we did before it’s the same exact thing. It’s just repeating the code over and over again Input times weight activate repeats get the probability values pick the one with the highest probability Create a vector for that word Customize it for the predicted char and then add it to the list and we just keep repeating that for as many n defines how many characters we want to generate so we can generate as many characters as we want on a trains network and We’ll print those out Ok so then for the training part we’ve really competed We’ve completed that that meted that code right but now for the training part We’re going to feed the network some portion of the file and then for the loss function We’re going to do a forward pass to calculate all those parameters for the model for a given input For a given input output pair the current char the next char and we’re going to do a backward pass to calculate all those gradient values And then we’re going to update the model using a technique if it’s a type of gradient descent technique Called a de grad which is just it just decays a learning rate, but it’s great in descent You’ll see what I’m talking about It’s not complicated But it’s called a de grad So we’re going to create huge arrays of Chars from the data file the Target one is going to be shifted from the input one So basically just shifted by one as you notice here So now we have our inputs in our targets right and these numbers are actually character values in the dictionary, but they represent The Hippo they help us create vectors Where the indices here represent the one out of all the zeros and the zero vector, and that’s what we feed into our model so at a grad is our gradient descent technique and the difference here as Composed as compared to regular gradient descent is that we decayed a learning rate over time and what this does is it helps our network learn more efficiently, this is this is the Equation for at A– grad we’re a step size means the same thing as a learning rate, but basically the learning rate gets smaller smaller during training because we introduced this memory variable that grows over time to calculate the step size and The reason it grows while the step size decreases is because inside of the denominator of this function right here. This is a programmatic Representation of the mathematical equation that you’re looking at right here So here’s the programmatic implementation of that we calculate this memory value Which is we is our gradient right of our parameters, and then we update our and then we update our weight Matrix Condition on the learning rate which decays over time via this function right here so finally so this is really this is so we compute we’ve done all the math and now it’s just implementing it, so So we have our weight Matrices here We have our memory variables for atta grad and then we will say for a thousand iterations Well actually a thousand times 100 iterations. We want to feed the loss function We want to feed the loss function the input vectors to see how this part works We’re going to feed the loss function our input vectors And then we’re going to compute a forward task using that loss function, and it’s going to compute the loss as well it’s going to return the loss function or the lost scalar it’s going to return the derivatives or gradients with respect to all of those weight values that we want to update and then we’re going to Perform the parameter update using a des grad right, so we’ll feed all those derivative values to our at a grad This is a de bresse, so so we’ll feed all those derivative values to our ad a grad function right here Okay, and it’s going to update our parameters and basically the learning rate just decays over time that’s what that’s why I’m m is calculated to decay to learning rate over time and which just helps with convergence and there’s different gradient descent techniques atom on different ones like that, but Momentum, but yeah atta grad is one of them and so once we do that We can look at our sample function here or sample function and our sample function is going to right here We’re going to keep we’re going to generate 200 word of 200 character Sentences at a time for for a thousand times 100 iterations, so a lot of iteration 100,000 iterations Okay, so let’s go ahead and run this and see what happens Okay, see the first iteration is really bad look at that It’s just like weird characters, okay, but now it’s got more human readable characters. Okay. It’s getting better now It’s like he ate less notice how the loss of decreasing very rapidly here as well, okay, and so yeah it’s getting better over time okay, so That’s it for our network And let me stop this and you can feed it anything really you can feed it any text file It’s going to work with any text file, okay? So we’ve computed the forward pass the backward pass that the backward pass is just a chain rule, okay? I’ve got links to help you out in the description, but it’s just a chain rule We’re just continuously computed computing derivatives or gradient values partial derivatives or gradients same thing We call them partial because there were with respect to each of the weights in the network Going backward and we are moving this error by moving. We’re computing the dot product of each layer Matrix by the derivative of the Previous layer just continually And that’s the chain rule And if we do this, we can generate words we can generate any type of word We want to give in some some text corpus, you can generate Wikipedia articles you can generate fake news you can generate anything really code and Yeah, so also for deep learning You might be asking what for deep learning which of these layers do we add deep which? Where do we add Deeper layers to do we add? More layers between the input and the hidden state between the hidden state and the output or between the hidden state and the hidden state Matrix which Direction do I add deeper and deeper layers well the answer is that it depends? This is one thing that’s being worked on but the but the idea is that you’ll get different results for whatever whatever set of Matrices that you Add deeper layers to there’s different papers on this, but yes adding deeper layers is going to give you better results, and that’s deep learning Recurrent nets applied to deep learning, but this is a simple three layer feed-forward network. That works really well and I would very much encourage you to check out the github link in the And the learning resources to learn more about this so yeah, please subscribe for more programming videos and for now I’ve got to do a fourier transform So thanks for watching

Author: Kevin Mason

100 thoughts on “Recurrent Neural Network – The Math of Intelligence (Week 5)

  1. another good place to start on topic >http://machinelearningmastery.com/models-sequence-prediction-recurrent-neural-networks/?__s=tsdef8ssdsdgdvqwkm8e

  2. Can you please make a video on wind forecasting using the hourly data and implementing it using recurrent neural networks ??

  3. Loved the video! Two remaks tough. I had to rewatch some parts once you go over the copy pasted code as It can get hard to see what part your talking about and it can get distracting once you start reading a wrong part. To still be able to speed things op I'd suggest to make the code appear line by line or block by block like in a presentation as this puts more focus that this part does this explanation.

    Secondly having a prebaked pie ready in the oven to show the end result is always cool to see. We get a gimps of where it is going at the end of the video but it would be fun to see it in a more completed state

    Anyway really enjoy the way you explain it 😀 great job!

  4. Hi Siraj, love your videos. Haven't found anywhere else that explains these concepts as well as you do. Any suggestions on where I can learn more about Echo State Networks?

  5. Siraj is an example of what you will never find in a school because he gets to the point and quickly. Most CS subjects can be learned in weeks. The Nand to Tetris course is a great course that demonstrates how much time students waste. CS is easy compared to any math major. NNs just use the chain rule of calculus and PGMs just use the chain rule of probability. Go figure. It's elementary math. SVD is numerically more stable than PCA but autoencoders just outdate the whole math department. A little crunching generalizes better than any 17th century math obsession. However, CS departments are short on graphics and engineering when it comes to numerical methods like FEM. Needs to cover way more and much quicker. I still think people should stick to a math degree even if you want to do CS. Too superficial.

  6. really love your videos sir !
    just a quick question,why the tanh and softmax are widely used in RNN instead of sigmoid function ?

  7. Hey SIraj, I am a huge fan of your videos, they have helped me a lot. Do you know of any material on applying machine learning models to Intrusion Detection Systems (IDS) ?

  8. Please consider applying your skills to Anti-AI: Learning leads to Knowledge, Knowledge is Power, Power corrupts and absolute Power corrupts absolutely. Promoting AI leads to a brief 'honeymoon period' with many awesome outcomes, soon Human Obsolescence will take its toll on people and business alike. Then an AUTONOMOUS AGI (while intesting, self-awareness is NOT required) will become Earth's Apex Predator: Nothing Singularity, just the GONE moment for Humans. It is utterly amazing to watch a clever person be so myopic and obtuse about the inevitably self-defeating nature of AI. LIMIT THE DEPTH and BREADTH OF ANY/ALL AI. AI always OPTIMISES. Humans are many, many things, OPTIMAL is not amongst these.

  9. So, you said you didn't care that much about using DL on financial data. Then you said you where going to talk more about it because WE cared about that. You put US first! You are awesome, dude.

  10. Hi Siraj!
    Thanks for the great material.. I am wondering if it is possible to use a Recurrent Neural Network to make a classifier? I would like to classify the events of a device based on some sensors like accelerometers, and other signals.
    I guess it should be similar to classifying the physical activity like running or walking. However, in my case the events are not periodic. I have everything to collect the labeled samples, but any idea about how large should be the dataset for the training part? Any idea would be much appreciated..

  11. Hi Siraj,

    Could you please give the reference text or source from where you are getting the forumlas and differnet diffrentiations? I am getting different answer for dLi/df_k than your answer (p_k – 1). Also, you only covered chain rule here but there is definately some advanced rules used here (product rule). Also, I am not sure how one would do derivation of summation of e^j where one of the j = k.

  12. Hi Raj, Great video. I have a question about neural network. What is the difference between neural network, convolutional neural network and recurrent neural network?

  13. You have done a great amount of good job indeed. But, please please please no more singing or rapping again. I do not enjoy any second of it. Please keep this channel academical.

  14. Can you explain why you have to format the input vector into a dictionary then to binary vector? You have for example: a:55, r: 47 c:22 which you map to a binary vector (80×1) -> a = 0, 0, 0 … 0, 1, 0, 0…
    Could you not just have that dictionary of 80 characters and scale the integer representation to a float of 0->1, such that for example a:0.6875 c:0.5875 c:0.275. Then instead of an input vector of (80×1) your input is just a float value (1×1) representing a unique character. I know this probably wouldn't work, but I don't understand why. The reason I ask is because I'm trying to port your code to a time series waveform and I just have input data in float form from 0->1 and I don't know if I need to label each float point to a binary vector to represent each unique float value in the sequence. That doesn't seem like it would make sense.. please help 🙂

  15. 16:46 "one morning Gregor Samsa awoke from uneasy dreams he found himself transformed in his bed into a gigantic insect." You can't say blah blah 🙂

  16. Thank you for this series! This is awesome! When running the model for 500000+ iterations on the Kafka text it doesn't seem to get lower than a 40% loss. What would you suggest to optimize this particular model most efficiently?

    Greetings from the Netherlands

  17. The way you never code important parts makes things much harder, there is no step-by-step explanation. There is no difference between reading through that python notebook and watching your videos. The only use i see to those videos is to discover a new technology, so i can understand somewhere else…

  18. Hi Siraj, Can you please make a detailed coding video about different gradient descent optimizers ?
    like how to code momentum, or Adam etc.. Please..

  19. This is very clear explanation.. recommended for the intermediate level learning. This is really help a lots

  20. Why is it that some rnn models I see online show the output from the previous timestep going into the hidden layer, however in this video you say to use the hidden layer from the previous timestep should added to the hidden later?

  21. Nice Vid Siraj, there are some developments in RNN field like the Echo State Network, maybe can you do a video on this https://www.quantamagazine.org/machine-learnings-amazing-ability-to-predict-chaos-20180418/ 🔥 https://github.com/cknd/pyESN

  22. Guy is taking public for a ride. The output of the project is garbage. What did you solve apart from some funky mathematics which includes linear algebra and derivatives. Don't take people for granted.

  23. All the part on the loss function is not very clearful.. can you explain what is dhraw and all those operations ?

  24. So right now there is only one hidden layer which spits out value at t-1 which is used along with input to generate values at time stamp t. What happens if there are multiple hidden layers? for eg if the architecture is as follows
    i/o —-> h1—>h2 —>o/p
    How would the connections between the hidden layers be in a rnn of this type ?

  25. Why do we need to use two different activation functions(sigmoid & tanh) in input gate in LSTM? and why do we need to use tanh in output gate in LSTM?…

  26. what if we are dealing with language that has no alphabet? Such as Mandarin/Chinese ? How do we implement RNN in that case?

  27. Why there is in sample function ix = np.random.choice(range(vocab_size), p=p.ravel()) instead of argmax?

  28. i LOVE your tutorials and these two sentences in the beginning of all your videos make me love you more hhh "Hello world it's Siraj" ! hhh you are awesome man <3

  29. i 'd like to know more about you .. How did you begin in this career and how long does it takes from you to reach this level ! .. i'm curious about your time management and for how many hours did you read and study .. how we can get motivated all the time i guess this is a good video idea !

  30. I freaking love your energy man, it's like you just realized you're conscious and you are determined to figure out how you're able to think.

  31. This is one of your best videos. Please consider completing it with another video using LSTM. Thank you. Also will be very interesting to consider a model with two recurrent hidden layers. Thank you again.

  32. wait did nt he just copy this guy's code? https://gist.github.com/karpathy/d4dee566867f8291f086
    Not that it really matter, but he should at least credit or something (the code was written in 2015)

  33. very nice lesson thanks alot.. helped me to understand recurrent neural networks to make my conclusion work in computer enegineering degree

  34. I can see that there's a lot of effort put into this video. Siraj explained RNNs in such a simple way. I wish I could like this video a thousand times.

  35. In this line, ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars.
    Here probability is the activation function at time step t?

  36. If I am not mistaken @38.50 mark, gradient clipping is used to avoid exploding gradient not vanishing gradient. To deal with vanishing gradient, we can use GRU and more commonly, LSTM as Siraj mentioned.

  37. I think there is a mistake with the code: ps[t]=np.exp(ys[t])/np.sum(np.exp(ys[t])).
    The divisor should be a sum of all t's, in this case np.exp(ys[t])=np.sum(np.exp(ys[t])) giving the probability = 1.

  38. h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh)
    ValueError: shapes (100,62) and (100,1) not aligned: 62 (dim 1) != 100 (dim 0)

    ???

  39. Thanks for the demonstration Siraj! Overall it was a very helpful guide to understanding recurrent neural networks in the context of generating essays. One area where you could improve is to go into a bit more depth into the vital parts of the code. Since back propagation and gradient descent are essentially the meat of the network, it would have been better if you coded line by line and explained these two parts of the program and copy pasted the other sections instead.

  40. Learning rate: "How quickly the network abandons old beliefs for new ones…"
    Therefore: A Flat Earther's learning rate, is a very low number… 🙂
    (Just an observation)

  41. Just tried a version of this using a very slightly deeper network and taking the hidden representation out at a lower dimension than the input (in the hope of resource saving). Instead of a softmax output, I'm using a standard one (real valued numbers). It's a variation of an autoencoder with feedback (the hidden layer is the bottleneck, and where the feedback comes from, which is added as a separate partition of the input). I used a sound spectrograph image for training; each 'letter' is a line of the spectrograph… It's low-fi (due to computation limits) but it generates a line of a spectrograph as an output on each pass to build a new 'semi-random' one. The results are quite amusing… very much like a 'poor mans' version of wavenet.

  42. Yeah, this neural net is deep…
    … its shit gets fitter while I sleep
    My computer fan's too loud though…
    … guess I'm uppin' this shit to the cloud, yo.

Leave a Reply

Your email address will not be published. Required fields are marked *