[MUSIC] Stanford University. It’s getting real today. So, let’s talk about a little

bit of the overview today. So, we’ll really get you into

the background for classification. And then, we’ll do some interesting things

with updating these word vectors that we so far have learned in

an unsupervised way. We’ll update them with some real

supervision signals such as sentiment and other things. Then, we’ll look at the first real

model that is actually useful and you might wanna use in practice. Well, other than, of course, the word

vectors, but one sort of downstream task which is window classification and we’ll

really also clear up some of the confusion around the cross entropy error and

how it connects with the softmax. And then, we’ll introduce the famous

neural network, our most basic LEGO block that we may start to call deep to get

to the actual title of this class. Deep learning in NLP.

And then, we’ll actually introduce another loss

function, the max margin loss and take our first steps into

the direction of backprop. So, this lecture will be,

I think very helpful for problem set one. We’ll go into a lot of the math

that you’ll need probably for number two in the problem set. So, I hope it’ll be very useful and

I’m excited for you cuz at the end of this lecture, you’ll feel hopefully a lot

better about the magic of deep learning. All right, are there any

organizational questions around problem sets or

programming sessions with the TAs? No, we’re all good? Awesome, thanks to the TAs for

clearing up everything. Cool, so let’s be very careful about

our notation today because that is one of the main things that

a lot of people trip up over as we go through very complex

chain-rules and so on. So, let’s start at the beginning and

say, all right, we have usually a training dataset

of some input X and some output Y. X could be in the simplest case, words

in isolation, just a single word vector. It’s not something you would

usually do in practice. But it’ll be easy for

us to learn that way. So we’ll start with that but then,

we’ll move to context windows today. And then eventually, we’ll use the same

basic building blocks that we introduce today for sentences and documents and

then complex interactions for everything. Now, the output in the simplest

case it’s just a single label. It’s just a positive or

a negative kind of sentence. It could be the named entities of

certain words in their context. It can also be other words, so

in machine translation, for instance, you might wanna output eventually

a sequence of other words as our yi and we’ll get to that in a couple weeks. And, yeah, basically they have multiword

sequences as potential outputs. All right, so what is the intuition for

classification? In the standard machine learning case,

so not yet the deep learning world, we usually just, for

something as simple logistic regression, basically want to define and learn

a simple decision boundary where we say everything to the left of this or

in one direction is in one class and the other one,

all the other things in the other class. And so, in general machine learning,

we assume our inputs, the Xs are kinda fixed,

they’re just set and we’ll only train the W parameter,

which is our softmax weights. So, we’ll compute the probability of Y,

given the input X with this kind of input. And so, one notational comment here is for the whole dataset,

we often subscript with i but then, when I drop the i we’re just looking

at a single example of x and y. Eventually, we’re going to overload

at the subscript a little bit and look at the indices of certain vector so,

if you get confused, just raise your hand and ask. I’ll try to make it clear

which one is which. Now, let’s dive into the softmax. We mentioned it before but we wanna really

carefully define and recall the notation here cuz we’ll go and take derivatives

with respect to all of these parameters. So, we can tease apart two steps here for

computing this probability of y given x. The first thing is, we’ll take the y’th

row of W and multiply that row with x. And so again this notation here,

when we have Wy. And that means we’ll have,

we’re taking the y’th row of this matrix. And then, multiplying it here with x. Now if we do that multiple times for

all c from one to our classes. So let’s say, this is 1, 2, 3,

the 4th row and multiply each of these. So then we get four numbers here. And these are unnormalized scores. And then, we’ll basically,

pipe this vector through the softmax to compute a probability

distribution that sums to one. All right, that’s our step one. Any questions around that? Cuz it’s just gonna keep

on going from here. All right, great. And, I get that sometimes

in general from previous sort of surveys, it seems to be that

15% of the class are usually bored when we go through all of these,

like all of these derivatives. 15% are super overwhelmed and then the

majority of people are like, okay, it’s a good speed, I’m learning something, I’m

getting it, and you’re making progress. So, sorry for the 30% for

whom this is too slow or too fast. You can probably just skim

through the lecture slides or speed it up if you’re watching online. If you’re super familiar with taking

super complex derivatives and if it’s a little overwhelming, then

definitely come to all the office hours. We have an awesome set of

TAs who will help you. All right, now we,

let’s look at a single example of an x and y that we wanna predict. In general, we want our model to

essentially maximize the probability of the correct class. We wanted to output the right class at the

end by taking the argmax of that output. And maximizing probability is the same

as maximizing log probability, it’s the same as minimizing the negative

of that log probability and that is often our objective function. So, why do we call this

the cross-entropy error? Well, we can define the cross-entropy

in the abstract in general as follows. So let’s assume we have

the ground truth or gold or target probability distribution,

we use those three terms interchangeably. Basically, what the ideal target

in our training dataset, the y and we’ll assume that, that is one at

the right class and zero everywhere else. So if we have for instance, five

classes here and it’s the center class. Its the third class and this would be one

and all the other, numbers would be zero. So, if we define this as p

in our computed probability, that our softmax outputs as q then we

would define here the cross-entropy is basically this sum

over all the classes. And in our case, p here is just

one-hot vector that’s really only 1 in one location and 0 everywhere else. So, all these other terms

are basically gone. And we end up with just log of q and that’s exactly the log of what

our softmax outputs, all right? And then, there are some nice connections

to Kullback-Leibler divergence and so on. I used to talk about it but

we don’t have that much time today. So and you can also if you’re

familiar of this in stats, you can see this as trying to minimize the

Kullback-Leibler divergence between these two distributions. But really, this is all you need to

know for the purpose of this class. So this is for

one element of your training data set. Now, of course, in general,

you have lots of training examples. So we have our overall objective

function we often denote with J, over all our parameters theta. And we basically sum these negative log

probabilities of the correct classes that we index here, a sub-index with yi. And basically we want to

minimize this whole sum. So that’s our cross-entropy error

that we’re trying to minimize, and we’ll take lots of derivatives off in

a lot of the next couple of hours. All right, any questions so far? So this is the general ML case where

we assume our inputs here are fixed. Yes, it’s a single number. So we are not multiplying a vector here,

so p(c) is the probability for that class, so that’s one single number. Great question. So the cross entropy, a single number,

our main objective that we’re trying to minimize, or

our error that we’re trying to minimize. Now, whenever you write

this F subscript Y here, we don’t want to forget that F is really

also a function of X, our inputs, right? It’s sort of an intermediate step and

it’s very important for us to play around with this notation. So we can also rewrite this as W y,

that row, times x, and

we can write out that whole sum. And that can often be helpful as you are

trying to take derivatives of one element at a time to eventually see the bigger

picture of the whole matrix notation. All right, so often we’ll write f here

in terms of this matrix notation. So this is our f, this is our W,

and this is our x. So just standard matrix

multiplication with a vector. All right, now most of the time we’ll

just talk about this first part of the objective function but

it’s a bit of a simplification because in all your real applications you will

also have this regularization term here. As part of your overall

objective function. And in many cases,

this theta here for instance, if it’s the W matrix of our

standard logistic regression, we’ll essentially just try this

part of the objective function. We’ll try to encourage the model to keep

all the weights as small as possible and as close as possible to zero. You can kind of assume if you want as

a Bayesian that you can have a prior, a Gaussian distributed prior that says

ideally all these are small numbers. Often times if you don’t have

this regularization term your numbers will blow up and

it will start to overfit more and more. And in fact, this kind of plot is

something that you will very often see in your projects and

even in the problem sets. And when I took my very first statistical

learning class, the professor said, this is the number one plot to remember. So, I don’t know if it’s that important,

but it is very, very important for all our applications. And it’s basically a pretty abstract plot. You can think of the x-axis as

a variety of different things. For instance, how powerful your model is. How many deep layers you’ll have or

how many parameters you’ll have. Or how many dimensions

each word vector has. Or how long you trained a model for. You’ll see the same kind of pattern

with a lot of different, x-axis and then the y-axis here is

essentially your error. Or your objective function that you’re

trying to optimize and minimize. And what you often observe is,

the more powerful your model gets, the better you are on

lowering your training error, the better you can fit these x-i,

y-i pairs. But at some point you’ll actually start

to over-fit, and then your test error, or your validation or

development set error, will go up again. We’ll go into a little bit more details

on how to avoid all of that throughout this course and

in the project advice and so on. But this is a pretty fundamental thing and

just keep in mind that for a lot of the implementations, and your projects you

will want this regularization parameter. But really it’s the same one for

almost all the objective functions so we’re going to chop it and mostly

focus on actually fitting our dataset. All right,

any questions around regularization? So basically, you can think of

this in terms of if you really care about one specific number,

then you can adjust all your parameters such that it will exactly

go to those different points. And if you force it to not do that,

it will kind of be a little smoother. And be less likely to fit

exactly those points and hence often generalize slightly better. And we’ll go through a couple of examples

of what this will look like soon. All right, now as I mentioned

in general machine learning, we’ll only optimize the W here,

the parameters of our Softmax classifier. And hence our updates and

gradients will only be pretty small, so in many cases we only have you

know a handful of classes and maybe our word vectors are hundred so if

we have three classes and 100 dimensional word vectors we’re trying to classify,

we’d only have 300 parameters. Now, in deep learning,

we have these amazing word vectors. And we actually will want to

learn not just the Softmax but also the word vectors. We can back propagate into them and

we’ll talk about how to do that today. Hint, it’s going to be taking derivatives. But the problem is when we update

word vectors, conceptually as you are thinking through this, you

have to realize this is very, very large. And now all of the sudden have a very

large set of parameters, right? Let’s say your word vectors

are 300 dimensional you have, you know 10,000 words in your vocabulary. All of the sudden you have an immensely

large set of parameters so on this kind of plot you’re going

to be very likely to overfit. And so before we dive into all this

optimization, I want you to get a little bit of an intuition of what

it means to update word vectors. So let’s go through a very simple example where we might want to

classify single words. Again, it’s not something

we’ll do very often, but let’s say you want to classify single

words as positive or negative. And let’s say in our training data set we

have the word TV and telly and say you know this is movie reviews and if you

say this movie is better suited for TV. It’s not a very positive thing to say

about a movie that’s just coming out into movie theaters. And so we would assume that

in the beginning telly, TV, and television are actually all

close by in the vector space. We learn something with word2vec or

glove vectors and we train these word vectors on a very, very large corpus and

it learned all these three words appear often in a similar context, so

they are close by in the vector space. And now we’re going to train but,

our smaller sentiment data set only includes in the training set, the X-i

Y-i as TV and telly and not television. So now what happens as we

train these word vectors? Well, they will start to move around. We’ll project sentiment into them and

so you now might see telly and TV, that’s a British dataset, so like to

move somewhere else into the vector space. But television actually stays

where it was in the beginning. And now when we want to test it, we would actually now misclassify this

word because it’s never been moved. And so what does that mean? The take home message here will be that if you have only a very

small training dataset. That will allow you especially with these

deep models to overfit very quickly, you do not want to train

your word vectors. You want to keep them fixed,

you pre-trained them with nice Glove or word2vec models on a very large corpus or you just downloaded them from the cloud

website and you want to keep them fixed, cuz otherwise you will

not generalize as well. However, if you have a very large dataset

it may be better to train them in a way we’re going to describe in

the next couple of slides. So, an example for

where you do that is, for instance, machine translation where you might have

many hundreds of Megabytes or Gigabytes of training data and you don’t really need to

do much with the word vectors other than initialize them randomly, and then train

them as part of your overall objective. All right, any questions around generalization

capabilities of word vectors? All right, it might still be

magical how we’re training this, so that’s what we’re gonna describe now. So, we rarely ever really

classify single words. Really what we wanna do is

classify words in their context. And there are a lot of fun and

interesting. Issues that arise in context really

that’s where language begins and grammar and

the connection to meaning and so on. So here, a couple of fun examples of

where context is really necessary. So for instance, we have some words

that actually auto-antonyms, so they mean their own opposite. So for instance to sanction can

mean to permit or to punish. And it really depends on the context for

you to understand which one is meant, or to seed can mean to place seeds or

to remove seeds. So without the context, we wouldn’t really

understand the meaning of these words. And in one of the examples that you’ll see

a lot, which is named entity recognition, let’s say we wanna find locations or

people names, we wanna identify is this the location or

not. You may also have things like Paris, which

could be Paris in France or Paris Hilton. And you might have Paris

staying in Paris and you still wanna understand

which one is which. Or if you wanna use deep learning for

financial trading and you see Hathaway, you wanna make sure that if it’s just a

positive movie review from Anne Hathaway. You’re not all the sudden buying

stocks from Berkshire Hathaway, right? And so,

there are a lot of issues that are fun and interesting and

complex that arise in context. And so, let’s now carefully walk

through this first useful model, which is Window classification. So, we’ll use as our first motivating

example here 4-class named entity recognition, where we basically

wanna identify a person or location or organization or none of the above for

every single word in a large corpus. And there are lots of different

possibilities that exist. But we’ll basically look

at the following model. Which is actually quite

a reasonable model. And also one that started in 2008. So the first beginning by Collobert and

Weston, a great paper, to do the first kind of useful state

of the art Text classification and word classification context. So, what we wanna do is basically train a

softmax classifier by assigning a label to the center word and then concatenating all

the words in a window around that word. So, let’s take for example this

subphrase here from a longer sentence. We basically wanna classify

the center word here which is Paris, in the context of this window. And we’ll define the window length as 2. 2 being 2 words to the left and 2 words to the right of the current center

word that we’re trying to classify. All right, so what we will do

is we’ll define our new x for this whole window as the concatenation

of these five word vectors. And just in general throughout all of this lecture all my

vectors are going to be column vectors. Sadly in number two of the problem set,

they’re row vectors. Sorry for that. Eventually, all these programming

frameworks they’re actually row-wise first and so it’s faster in the low-level

optimization to use row vectors. For a lot of the math it’s actually I find

it simpler to think of them as column vectors so. We’re very clear in the problem set but

don’t get tripped up on that. So basically, we’ll define this here as

one five D dimensional column vector. So, we have T dimensional word vectors,

we have five of them and we stack them up in one column, all right. Now, the simplest window classifier that

we could think of is to now just put the softmax on top of this

concatenation of five word vectors and we’ll define this, our x here. Our inputs is just the x of the entire

window for this concatenation. And we have the softmax on top of that. And so, this is the same

notation that we used before. We’re introducing here y hat,

with sadly the subscript y for the correct current class. It’s tough, I went through [LAUGH] several

iterations, it’s tough to have like prefect notation that works

through the entire lecture always. But you’ll see why soon. So, our overall objective here is,

again, this whole sum over all these probabilities that we have,

or negative log of those. So now, the question is, how do we

update these word vectors x here? One x is a window, and

x is now deep inside the softmax. All right, well, the short answer

is we’ll take a lot of derivatives. But the long answer is, you’re gonna have

to do that a lot in problem set one and maybe in the midterm. So, let’s be a little more helpful, and

actually go through some of the steps and give you some hints. So some of this, you’ll actually

have to do in your problem set, so I’m not gonna go through all the details. But I’ll give you a couple of hints

along the way and then you can know if you’re hitting those and then you’ll

see if you’re on the right track. So, step one, always very

carefully define your variables, their dimensionality and everything. So, y hat will define as the softmax

probability of the vector. So, the normalized scores or

the probabilities for all the different classes that we have. So, in our case we have four. Then we have the target distribution. Again, that will be a one hot

vector where it’s all zeroes except at the ground truth index of the class y,

where it’s one. And we’ll define our f

here as f of x again, which is this matrix multiplication. Which is going to be a C dimensional

vector where capital C is the number of classes that we have, all right. So, that was step one. Carefully define all of your variables and

keep track of their dimensionality. It’s very easy when you implement this and

you multiply two things, and they have wrong dimensionality, and

you can’t actually legally multiply them, you know you have a bug. And you can do this also

in a lot of your equations. You’d be surprised. In the midterm, you’re nervous. But maybe at the end you have some time. And you could totally grade it

by yourself in the first pass, by just making sure that all your

dimensionality of your matrix and vector multiplications are correct. All right, the second tip is the chain

rule, we went over this before, but I heard there’s a little bit of

confusion still in the office hours. So, let’s define this carefully for

a simple example and then we’ll go and give you a couple more hints also for

more complex example. So again, if you have something

very simple, such as a function y, which you can defined here as f of u and

u can be defined as g of x as in the whole function, y of x,

can be described as f of g of x, then you would basically multiply dy,

u times the udx. And so very concretely here,

this is sort of high school level, but we’ll define it properly in

order to show the chain rule. So here,

you can basically define u as g(x), which is just the inside in

the parentheses here, so x cubed + 7. It can have y as a function of f(u), where we use 5 times u,

just replacing the inside definition here. So it’s very simple,

just replacing things. And now, we can take the derivative

with respect to u and we can take the derivative

with respect to x(u). And then we just multiply these two terms,

and we plug in u again. So in that sense, we all know,

in theory, the chain rule. But, now we’re gonna have the softmax, and we’re gonna have lots of matrices and

so on. So, we have to be very,

very careful about our notation. And we also have to be

careful about understanding, which parameters appear inside

what other higher level elements. So, f for instance is a function of x. So, if you’re trying to take

a derivative with respect to x, of this overall soft max you’re gonna have

to sum over all of the different classes inside which x appears. And you’ll see here,

this first application, but not just of fy again this is just

a subscript the y element of the effector which is the function of x, but

also multiply it then here by this. So, when you write this out,

another tip that can be helpful is for this softmax part of he derivative

is to actually think of two cases. One where c=y, the correct class, and one where it’s basically all

the other incorrect classes. And as you write this out,

you will observe and come up with something like this. So, don’t just write that as your thing

you have to put in your problems, the steps on how to get there. Bur, basically at some point you

observe this kinda pattern when you now try to look at all the derivatives

with respect to all the elements of f. And now,

when you have this you realize ,okay at the correct class we’re

actually subtracting one here, and all the incorrect classes,

you will not do anything. Now, the problem is when

you implement this, it kind of looks like

a bunch of if statements. If y equals the correct class for my training set, then, subtract 1,

that’s not gonna be very efficient. Also, you’re gonna go insane if you try

to actually write down equations for more complex neural network

architectures ever. And so, instead, what we wanna do is

always try to vectorize a lot of our notation, as well as our implementation. And so, what this means here,

in this case, is you can actually observe that,

well, this 1 is exactly 1, where t, our hot to target distribution,

also happens to be 1. And so, what you’re gonna wanna do,

is basically describe this as y(hat)- t, so

it’s the same thing as this. And don’t worry if you don’t

understand how we got there, cuz that’s part of your problem set. You have to, at some point, see this equation while you’re

taking those derivatives. And now, the very first baby step towards

back-propagation is actually to define this term, in terms of a simpler single

variable and we’ll call this delta. We’ll get good, we’ll become good friends

with deltas because they are sort of our error signals. Now, the last couple of tips. Tip number six. When you start with this chain rule, you

might want to sometimes use explicit sums, before and

look at all the partial derivatives. And if you do that a couple of times

at some point you see a pattern, and then you try to think of how to

extrapolate from those patterns of single partial derivatives,

into vector and matrix notation. So, for example,

you’ll see something like this here, in at some point in your derivation. S,o the overall derivative with respect to

x of our overall objective function for one element, for one element from our

training set x and y is this sum. And it turns out when you

think about this for a while, you take here this row vector but

then you transpose it, and becomes an inner product, well if you

do that multiple times for all the C’s and you wanna get in the end a whole vector

out, it turns out you can actually just re-write the sum as W

transpose* the delta. So, this is one error signal here

that we got from our softmax, and we multiply the transpose of

our softmax weights with this. And again,

if some of these are not clear and you’re confused,

write them out into full sum, and then you’ll see that it’s really

just re-write this in vector notation. All right, now what is the dimensionality

of the window vector gradient? So in the end, we have this derivative

of the overall cost here for one element of our training

set with respect to x. But x is a window. All right, so

each say we have a window of five words. And each word is d-dimensional. Now, what should be the dimensionality

of this derivative of this gradient? That’s right,

it’s five times the dimensionality. And that’s another really good way, and

one of the reasons we make you implement this from scratch, if you have any kinda

parameter, and you have a gradient for that parameter, and they’re not the same

dimensionality, you’ll also know you screwed up and there’s some mistake or

bug in either your code or your map. So, it’s very simple debugging skill. And way to check your own equations. So, the final derivative with respect

to this window is now this five vector because we had five d-dimensional

vectors that we concatenated. Now, of course the tricky bit is, you actually wanna update your word

vectors and not the whole window, right? The window is just this

intermediate step also. So really, what you wanna do is update and take derivatives with respect to each

of the elements of your word vectors. And so it turns out, very simply,

that can be done by just splitting that error that you’ve got on the gradient

overall, at the whole window and that’s just basically the concatenation of the

reduced of all the different word vectors. And those you can use to update your word

vectors, as you train the whole system. All right, any questions? Is there a mathematical what? Is there a mathematical notation for

the word vector t, other than it’s just variable t? Or that seems like a fine notation. You can see this as a probability

distribution, that is very peaked.>>Yeah.

>>That’s all, there’s nothing else to it. Just a single vector with all zeroes,

except in one location.>>So I’ll just write that down?>>You can write that up, yeah. You can always just write out and

it’s also something very important. You always wanna define everything, so

that you make sure that the TAs know that you’re thinking about the right thing,

as you’re writing out your derivatives, you write out the dimensionality,

you define them properly, you can use dot, dot,

dot if it’s a larger dimensional vector. You can just define t as your

target distribution [INAUDIBLE]>>The question is, do we still have two vectors for

each word? Great question, no. We essentially, when we did glove and

word2vec, and had these two u’s and v’s, for all subsequent lectures from now on,

we’ll just assume we have the sum of u and v and that’s our single vector x,

for each word. So, the question is does this gradient

appear in lots of other windows and it does. So, if you, the answer is yes. If you have the word “in,” that vector

here and the gradients will appear in all the windows that have

the word “in” inside of them. And same with museums and so on. And so as you do stochastic gradient

descent you look at one window at a time, you update it, then you go to the next

window, you update it and so on. Great questions. All right. Now, let’s look at how we update

these concatenated word vectors. So basically, as we’re training this,

if we train it for instance with sentiment we’ll push all

the positive words in one direction and the other words in other direction. If we train it, for

named entity recognition and eventually our model can learn that seeing

something like in as the word just before the center word, would be indicative for

that center word to be a location. So now what’s missing for

training this full window model? Well mainly the gradient of J with

respect to the softmax weights W. And so

we basically will take similar steps. We’ll write down all the partial

derivatives with respect to Wij first and so on. And then we have our full gradient for

this entire model. And again, this will be very sparse, and you’re gonna wanna have some clever ways

of implementing these word vector updates. So you don’t send a bunch of zeros

around at every single window, Cuz each window will

only have a few words. So in fact, it’s so important for

your code in the problem set to think carefully through your

matrix implementations, that it’s worth to spend two or

three slides on this. So there are essentially two very

expensive operations in the softmax. The matrix multiplication and

the exponent. Actually later in the lecture, we’ll

find a way to deal with the exponent. But the matrix multiplication can also

be implemented much more efficiently. So you might be tempted in the beginning

to think this is probability for this class and

this is the probability for that class. And so implemented a for

loop of all my different classes and then I’ll take derivatives or

matrix multiplications one row at a time. And that is going to be very,

very inefficient. So let’s go through some very simple

Python code here to show you what I mean. So essentially,

always looping over these word vectors instead of concatenating

everything into one large matrix. And then multiplying these is

always going to be more efficient. So let’s assume we have 500

windows that we want to classify, and let’s assume each window

has a dimensionality of 300. These are reasonable numbers, and let’s assume we have five

classes in our softmax. And so at some point during

the computation, we now have two options. So W here are weights for the softmax. It’s gonna be C many rows and

d many columns. Now the word vectors here that

you concatenated for each window. We can either have the list of

a bunch of separate word vectors, or we can have one large matrix

that’s going to be d times n. So d many rows and n many windows. So we have 500 windows, so

we have 500 columns here in this 1 matrix. And now essentially, we can multiply

the W here for each vector separately, or we can do this one matrix

multiplication entirely. And you literally have

a 12x speed difference. And sadly with these larger models,

one iteration or something might take a day, eventually for

more complex models large data sets. So the difference is between

literally 12 days or 1 day of you iterating and

making your deadlines and everything. So it’s super important,

and now sometimes people are tripped up by what does it

mean to multiply and do this here. Essentially, it’s the same

thing that we’ve done here for one softmax, but

what we did is we actually concatenated. A lot of different input vectors x, and so we’ll get a lot of different

unnormalized scores out at the end. And then we can tease them apart again for

them. So you have here, c times t dimensional

matrix for the d dimensional input. So using the same notation, yeah, dimensional of each window times d times

n matrix to get a c times n matrix. So these are all

the probabilities here for your N many training samples. Any questions around that? So it’s super important, all your code

will be way too slow if you don’t do this. And so

this is very much an implementation trick. And so in most of the equations, we’re not gonna actually go there cuz

that makes everything more complicated. And the equations look at only

a singular example at a time, but in the end you’re gonna wanna

vectorize all your code. Yeah, matrices are your friend,

use them as much as you can. Also in many cases, especially for

this problem set where you really understand the nuts and bolts of how

to train and optimize your models. You will come across a lot

of different choices. It’s like,

I could implement it this way or that way. And you can go to your TA and ask,

should I implement this way or that way? But you can also just use time it

as your magic Python and just let, make a very informed decision and

gain intuition yourself. And just basically wanna

speed test a lot of different options that you have in

your code a lot of the time. All right, so

this is was just a pure softmax, and now the softmax alone

is not play powerful. Because it really only gets with this

linear decision boundaries in your original space. If you have very, very little

training data that could be okay, and you kind of used a not so powerful model

almost as an abstract regularizer. But with more data,

it’s actually quite limiting. So if we have here a bunch of words and

we don’t wanna update our word vectors, softmax would only give us this linear

decision boundary which is kind of lame. And it would be way better if we could correctly classify these

points here as well. And so basically, this is one of the many

motivations for using neural networks. Cuz neural networks will give us much

more complex decision boundaries and allow us to fit much more complex

functions to our training data. And you could be snarky and actually rename neural networks

which sounds really cool. It’s just general function approximators. Just wouldn’t have quite the same ring to

it, but it’s essentially what they are. So let’s define how we get from

the symbol of logistic regression to a neural network and beyond,

and deep neural nets. So let’s demystify the whole

thing by starting, defining again some of the terminology. And we can have more fun with the math,

and then one and a half lectures from now. We can just basically use

all of these Lego blocks. So bear with me,

this is going to be tough. And try to concentrate and

ask questions if you have any, cuz we’ll keep building now a pretty

awesome large model that’s really useful. So we’ll have inputs, we’ll have

a bias unit, we’ll have an activation function and output for each single

neuron in our larger neuron network. So let’s define a single neuron first. Basically, you can see it as

a binary logistic regression unit. We’re going to have inside, again a set of weights that we

have in a product with our input. So we have the input x

here to this neuron. And in the end,

we’re going to add a bias term. So we have an always on feature, and that kind of defines how likely

should this neuron fire. And by firing, I mean have a very

high probability that’s close to one. For being on. And f here is always, from now on,

going to be this element wise function. In our case here the sigmoid that just

squashes whatever this sum gives us in our product plus the bias term and basically

just squashes it to be between 0 and 1. All right, so this is the definition

of the single neuron. Now if we feed a vector of inputs through

all this different little logistic regression functions and

neurons, we get this output. And now the main difference between

just predicting directly a softmax and standard machine learning and deep learning is that we’ll actually not

force this to give directly the output. But they will themselves be inputs to yet

another neuron. And it’s a loss function on top of that

neuron such as cross entropy that will now govern what these

intermediate hidden neurons. Or in the hidden layer what they

will actually try to achieve. And the model can decide itself

what it should represent, how it should transform this input

inside these hidden units here in order to give us a lower

error at the final output. And it’s really just this

concatenation of these hidden neurons, these little binary

logistic regression units that will allow us to build very

deep neural network architectures. Now again, for sanity’s sake, we’re

going to have to use matrix notation cuz all of this can be very simply described

in terms of matrix multiplication. So a1 here is where going to be the final activation of the first neuron,

a2 in second neuron and so on. So instead of writing out the inner

product here, or writing even this as an inner product plus the bias term

we’re going to use matrix notation. And it’s very important now to pay

attention to this intermediate variables that we’ll define because

we’ll see these over and over again as we use a chain

rule to take derivatives. So we’ll define z here as W

times x plus the bias vector. So we’ll basically have here as

many bias terms and this vector has the same dimensionality as the number

of neurons that we have in this layer. And W will have number of rows for

the number of neurons that we have times number of columns for

the input dimensionality of x. And then, whenever we write a of f(z), what that means here is that we’ll

actually apply f element wise. So f(z) when z is a vector is just f(z1),

f(z2) and f(z3). And now you might ask, well, why do we

have all this added complexity here with this sigmoid function. Later on we can actually have other

kinds of so called non linearities. This f function and

it turns out that if we don’t have the non-linearities in between and

we will just stack a couple of this linear layers together it wouldn’t

add a very different function. In fact it would be continuing to

just be a single linear function. And intuitively as you

have more hidden neurons, you can fit more and

more complex functions. So this is like a decision boundary

in a three dimensional space, you can think of it also in

terms of simple regression. If you had just a single hidden neuron, you kinda see here almost

an inverted sigmoid. If you have three hidden neurons,

you could fit this kind of more complex functions and with ten neurons,

each neuron can start to essentially, over fit and try to be very good

at fitting exactly one point. All right, now let’s revisit our

single window classifier and instead of slapping a softmax directly

onto the word vectors we’re now going to have an intermediate hidden layer

between the word vectors and the output. And that’s when we really start to

gain an accuracy and expressive power. So let’s define a single

layer neural network. We have our input x that will be again, our window, the concatenation

of multiple word vectors. We’ll define z and we’ll define a as

element wise on the areas a and z. And now, we can use this

neural activation vector a as input to our final classification layer. The default that we’ve had so

far was the softmax, but let’s not rederive the softmax. We’ve done it multiple times now,

you’ll do it again in a problem set and introduce an even simpler one and walk through all the glory details

of that simple classifier. And that will be a simple,

unnormalized score. And this case here, this will

essentially be the right mechanism for various simple binary

classification problems, where you don’t even care that much

about this probability z is 0.8. You really just cares like, is it one,

is it in this class, or is it not? And so we’ll define the objective function

for this new output layer in a second. Well, let’s first understand

the feed-forward process. And well feed-forward process is what you

will end up using a test time and for each element also in training

before you can take derivative. Always be feed-forward and

then backward to take the derivatives. So what we wanna do here is for example, take basically each window and

then score it. And say if the score is high we want to

train the model such that it would assign high scores to windows where the center

word is a named entity location. Such as Paris, or London, or Germany,

or Stanford, or something like that. Now we will often use and you’ll see a in a lot of papers this kind

of graph, so it’s good to get used to it. There are various other kinds,

and we’ll try to introduce them slowly throughout the lecture but

this is the most common one. So we’ll define bottom up,

what each of these layers will do and then we’ll take the derivatives and

learn how to optimize it. Now x window here is the concatenation

of all our word vectors. So let’s hear, and

I’ll ask you a question in a second, let’s try to figure out the dimensionality

here of all our parameters so that you’re, I know you’re with me. So let’s say each of our word

vectors here is four dimensional and we have five of these word vectors in

each window that are concatenated. So x is a 20 dimensional vector. And again,

we’ll define it as column vectors. And then lets say we have

in our first hidden layer, lets say we have eight units here. So you want an eight unit hidden layer

as our intermediate representation. And then our final scores just

again a simple single number. Now what’s the dimensionality

of our W given what I just said? 20 dimensional input, eight hidden units. 20 rows and eight columns. We have one more transfer,

[LAUGH] that’s right. So it’s going to be eight rows and

20 columns, right? And you can always

whenever you’re unsure and you have something like this then

this will have some n times d. And then multiply this and then this

will have, this will always be d, and so these two always

have to be the same, right? So all right, now what’s the main intuition behind this

extra layer, especially for NLP? Well, that will allow

us to learn non-linear interactions between these

different input words. Whereas before, we could only say

well if in appears in this location, always increase the probability

that the next word is a location. Now we can learn things and patterns like,

if in is in the second position, increase the probability of this being the location

only if museum is also the first vector. So we can learn interactions

between these different inputs. And now we’ll eventually make

our model more accurate. Great question. So do I have a second W there. So the second layer here the scores

are unnormalized, so it’ll just be U and because we just have a single U, this will

just be a single column vector and we’ll transpose that to get our inner product

to get a single number out for the score. Sorry, yeah, so the question was

do we have a second W vector. So yeah, that’s in some

sense our second matrix, but because we only have one hidden neuron in

that layer, we only need a single vector. Wonderful. All right, so,

now let’s define the max-margin loss. It’s actually a super powerful loss

function often is even more robust than the cross entropy error in softmax,

and is quite powerful and useful. So let’s define here two examples. Basically, you want to give

a high score to windows, where the center word is a location. And we wanna give low scores to corrupt or incorrect windows where the center

word is not a named entity location. So museum is technically a location,

but it’s not a named entity location. And so the idea for this training objective of max-margin is

to essentially try to make the score of the true windows larger than the ones of

the corrupt windows smaller or lower. Until they’re good enough. And we define good enough as being

different by the value of one. And this one here is a margin. You can often see it as

a hyperparameter too and set it to m and try different ones but

in many cases one works fine. This is continuous and

we’ll be able to use SGD. So now what’s the intuition behind the

softmax, sorry the max-margin loss here? If you have for

instance a very simple data set and you have here a couple

of training samples. And here you have the other class c,

what a standard softmax may give you is a decision

boundary that looks like this. It’s like perfectly separates the two. It’s a very simple training example. Most standard softmax

classifiers will be able to perfectly separate these two classes. And again, this is just for

illustration in two dimensions. These are much higher

dimensional problems and so on. But a lot of the intuition

carries through. So now here we have our decision

boundary and this is the softmax. Now, the problem is maybe that

was your training data set. But your test set, actually,

might include some other ones that are quite similar to those stuff you saw

at training, but a little different. And now this kind of decision

boundary is not very robust. In contrast to this, what the max margin loss will attempt to do is to

try to increase the margin between the closest points

of your training data set. So if you have a couple of points here and

you have different points here. We’ll try to maximize the distance between the closest points here, and

essentially be more robust. So then if at test time you have some

things that are kinda similar, but not quite there, you’re more likely

to also correctly classify them. So it’s a really great lost or

objective function. Now in our case here when we say a sc for

one corrupt window. In many cases in practice we’re

actually going to have a sum over multiple of these. And you can think of this similar to the

skip-gram model where we sample randomly a couple of corrupt examples. So you really only need for

this kind of training a bunch of true examples of this

is a location in this context. And then all the other windows

where you don’t have that as your training data are essentially

part of your negative class. All right, any questions around

the max-margin objective function? We’re gonna take a lot of

derivatives of it now. That’s right, is the corrupt

window just a negative class? Yes, that’s exactly right. So you can think of any other window that

doesn’t have as its center location just as the other class. All right, now how do we optimize this? We’re going to take very similar steps to

what we’ve done with cross entropy, but now we actually have this hidden layer and

we’ll take our second to last step towards the full back-propagation algorithm

which we’ll cover in the next lecture. So let’s assume our cost

J here is larger than 0. So what does that mean? In the very beginning you will initialize

all your parameters here again. Either randomly or maybe you’ll initialize

your word vectors to be reasonable. But they’re not gonna be quite perfect at

learning in this context in the window what is location and what isn’t. And so in the beginning all your scores

are likely going to be low cuz all our parameters, U and W and b have been

initialized to small, random numbers. And so I’m unlikely going to be great

at distinguishing the window with a correct location at center

versus one that is corrupt. And so basically,

we will be in this regime. After a while of training, eventually

you’re gonna get better and better. And then intuitively

if your score here for instance of the good window is five and

one of the corrupt is just two, then you’ll see 1- 5 + 2 is less than 0 so you just basically have 0

loss on those elements. And that’s another great property of

this objective function which is over time you can start ignoring more and more

of your training set cuz it’s good enough. It will assign 0 cost as in 0 error to these examples and so

you can start to focus on your objective function only on the things that the model

still has trouble to distinguish. All right, so let’s in the very

beginning assume most of our examples will J will be larger than 0 for them. And so what we’re gonna have to do now

is take derivatives with respect to all the parameters of our model. And so what are those? Those are U, W, b and our word vectors x. So we always start from the top and then

we go down because we’ll start to reuse different elements and just the simple

combination of taking derivatives and reusing variables is going to

lead us to back propagation. So derivative of s with respect to U. Well, what was s? s was just u transpose times a and so we all know that derivative

of that is just a. So that was easy, first element,

first derivative super straight forward. Now it’s important when we

take the next derivative to also be aware of all our definitions. How we define these functions that

we’re taking derivatives off. So s is basically U transpose a,

a was f(z) and z was just Wx + b. All right,

it’s very important to just keep track. That’s like almost 80% of the work. Now, let’s take

the derivative like I said, first partial of only one

element of W to gain intuitions. And then we can put it back together and

have a more complex matrix notation. So we’ll observe for

Wij that it will actually only appear in the ith activation of our hidden layer. So for example, let’s say we have a very

simple input with a three dimensional x. And we have two hidden units,

and this one final score U. Then we’ll observe that if we take

the derivative with respect to W23. So the second row and

the third column of W, well that actually only is needed in a2. You can compute a1 without using W23. So what does that mean? That means if we take

the derivative of weight Wij, we really only need to look at

the ith element of the vector a. And hence, we don’t need to look

at this whole inner product. So what’s the next step? Well as we’re taking derivatives with W,

we need to be again aware of where does W appear and all the other parameters

are essentially constant. So U here is not something

we’re taking a derivative off. So what we can do is just take it out,

just as like a single number, right. We’ll just get it outside,

put the derivative inside here. And now, we just need to very

carefully define our ai. So a subscript i, so

that’s where Wij appears. Now, ai was this function,

and we defined it as f of zi. So why don’t we just

write this carefully out, and now this is first application

of the chain rule with derivative of ai with respect to zi,

and then zi with respect to Wij. So this is single application

of the chain rule. And then end of it it looks kind of

overwhelming, but each step is very clear. And each step is simple, we’re really

writing out all the glory details. So application of the chain rule,

now we’re going to define ai. Well ai is just f of zi, and f was just an

element y function on a single number zi. So we can just rewrite ai with

its definition of f of zi, and we keep this one intact, all right? And now derivative of f,

we can just for now assume is f prime. Just a single number, take derivative. We’ll just define this as f prime for now. It’s also just a single number,

so no harm done. Now we’re still in this part here, where we basically wanna take

the derivative of zi with respect to Wij. Well let’s define what zi was,

zi was just here. The W of the ith row times x

plus the ith element of b. So let’s just replace zi

with it’s definition. Any questions so far? All right, good or not? So we have our f prime and

we have now the derivative with respect to Wij of just

this inner product here. And we can again,

very carefully write out well, the inner product is just this

row times this column vector. That’s just the sum, and now when we

take the derivative with respect to Wij, all the other Ws are constants. They fall out, and so

basically it’s only the xk, the only one that actually appears

in the sum with Wij is xj and so basically this derivative is just Xj. All right, so now we have this

whole expressions of just taking carefully chain rule multiplications

definitions of all our terms and so on. And now basically, what we’re gonna want

to do is simplify this a little bit, cuz we might want to

reuse different parts. And so we can define, this first term here

actually happens to only use subindices i. And it doesn’t use any other subindex. So we’ll just define Uif prime of zi for all the different is as delta i. At first notational simplicity and

xj is our local input signal. And one thing that’s very helpful for you to do is actually look at also the

derivative of the logistic function here. Which can be very conveniently computed

in terms of the original values. And remember f of z here, or f of zi of each element is

always just a single number. And we’ve already computed it

during forward propagation. So we wanna ideally use hidden activation

functions that are very fast to compute. And here, we don’t need to compute

another exponent or anything. We’re not gonna recompute f of zi cuz

we already did that in the forward propagation step. All right, now we have the partial derivative

here with respect to one element of W. But of course, we wanna have the whole

gradient for the whole matrix. So now the question is,

with the definitions of this delta i for all the different elements of

i of this matrix and xj for all the different elements of the input. What would be a good way of trying to

combine all of these different elements to get a single gradient for the whole

matrix W, if we have two vectors. That’s right. So essentially, we can use delta

times x transpose, namely the outer product to get all the combinations

of all elements i and all elements j. And so this again might seem

like a little bit like magic. But if you just think again of

the definition of the outer product here. And you write it out in terms of all

the indices, you’ll see that turns out to be exactly what we would want in

one very nice, very simple equation. So we can kind of think of this delta

term actually as the responsibility of the error signal that’s now arriving from

our overall loss into this layer of W. And that will eventually

lead us to flow graphs. And that will eventually lead us to you

not having to actually go through all this misery of taking all these derivatives. And being able to abstract it

away with software packages. But this is really the nuts and

bolts of how this works, yeah? Yeah, the question is, this outer product

will get all the elements of i and j? And that’s right. So when we have delta times x transposed. Then now we have basically here,

x is usually this vector. So now let’s take the right notation. So we wanna have derivative

with respect to W. W was a, 2×3 dimension matrix for

example, 2×3. We should be very careful of our notation. 2×3. So now,

the derivative of j with respect to our w has to, in the end, also be a 2×3 matrix. And if we have delta times x transposed,

then that means we’ll have to have a two-dimensional delta, which is

exactly the dimensions that are coming in. [INAUDIBLE] Signal that I

mentions that we have for the number of hidden units that we have. Times this one dimensional,

basically row vector times xt which is a 1 x 3 dimensional

vector that we transpose. And so, what does that mean? Well, that’s basically multiplying now,

standard matrix multiplication. You should write that. So now the last term that we haven’t

taken derivatives of off the [INAUDIBLE], is our bi and

it’ll eventually be very similar. We’re going to go through it. We can pull Ui out, we’re going to

take f prime, assume that’s the same. So now, this is our delta i. We’ll observe something very similar. These are very similar steps for bi. But in the end, we’re going to

just end up with this term and that’s just going to be one. And so,

the derivative of our bi element here, is just delta i and we can again

use all the elements of delta, to have the entire gradient for

the update of b. Any questions? Excellent, so this is essentially,

almost back-propagation. We’ve so far only taken derivatives and

using the chain rule. And first thing, when I went through this, this is like a lot of the magic of deep

learning, is just becoming a lot clear. We’ve just taken derivatives, we have

an objective function and then we update based on our derivatives, all

the parameters of these large functions. Now the main remaining trick, is to re-use

derivatives that we’ve computed for the higher layers in computing

derivatives for the lower layers. It’s very much an efficiency trick. You could not use it and it would

just be very, very inefficient to do. But this is the main insight of why we re-named taking

derivatives as back propagation. So what is the last derivatives

that we need to take? For this model, well again,

it’s in terms of our word vectors. So let’s go through all of those. Basically, we’ll have to take the

derivative of the score with respect to every single element of our word vectors. Where again, we concatenated all

of them into a single window. And now, the problem here is that each word vector actually

appears in both of these terms. And both hidden units use all of

the elements of the input here. So we can’t just look at a single element. We’ll really have to sum over, both of the

activation units in the simple case here, where we just have two hidden units and

three dimensional inputs. Keeps it a little simpler,

and there’s less notation. So then, we basically start with this. I have to take derivatives with

respect to both of the activations. And now, we’re just going to go

through similar kinds of steps. We have s. We defined s as u transpose

times our activation. That was just Ui then ai

was just f of w and so on. Now, what we’ll observe as we’re going

through all these similar steps again is that, we’ll actually see the same

term here reused from before. It’s Ui x F prime of Zi. This is exactly the same. That we’ve seen here. F prime of Zi. And what that means is,

we can reuse that same delta. And that’s really one of the big insights. Fairly trivial but very exciting,

cuz it makes it a lot faster. But, what’s still different now, is that of course we have to take

the the derivative with respect. To each of these, to this inner product

here in Xj, where we basically dumped the bias term, cuz that’s just a constant,

when we were taking this derivative. And so, this one here again,

Xj is just inner product, it’s the jth element of this matrix

W that’s the relevant one for this inner product,

let me take the derivative. So now we have this sum here, and

now comes again this tricky bit of trying to simplify this sum into something

simpler in terms of matrix products. And again, the reason we’re getting

towards back propagation is that we’re reusing here these previous error signals,

and elements of the derivative. Now, the simplest, the first thing we’ll

observe here as we’re doing this sum, is that sum is actually also a simple inner

product, where we now take the jth column. So this again, this dot notation

when the dot is after the first, and next we take the row,

here we take the column. So it’s a column vector. But then of course we transpose it, so it’s a simple inner product for

getting us a single number. Just the derivative of this element of

the word vectors and the word window. Yes. Great question. So once we have the derivatives for all these different variables, what’s

the sequence in which we update them, and there’s really no sequence we

update them all in parallel. We just take one step in all the elements

that we now had a variable in or have seen that parameter in. And the complexity there,

is in standard machine learning you’ll see in many models just like

standard logistic regression, you see all your parameters like

your W in all the examples. And ours, it’s a little more complex,

because most words you won’t see in a specific window and so, you only update

the words that you see in that window. And if you assumed all the other ones,

you’d just have very, very large, quite sparse updates, and that’s not

very RAM efficient, great question. So now we have this simple

multiplication here and the sum is just is just inner product. So far so simple, and we have our D

dimension vector which I mentioned, is two dimensions. We have the sum over two elements. So, so far so good. Now, really, we would like to get the full

gradient here with respect to all XJs for J equals one to three and

its simple case, or five D if we have a five

word large window. So now the question is, how do we

combine this single element here. Into a vector that eventually gives us all

the different gradients for all the xij. And j equals 1 to however long our window

is Is anybody follow along this closely? That’s right. W transposed delta. Well done. So basically our final derivative and

final gradient here for. Our score s with respect to the entire

window, is just W transpose times delta. Super simple very fast to implement,

I can easily think about how to vectorize this again by concatenating multiple

deltas from multiple Windows and so on. And it can be very efficiently,

like implemented and derived. All right, now the error message is

delta that arrives at this hidden layer, has of course the same dimensionality as

its hidden layer because we’re updating all the windows. And now from the previous slides we

also know that when we update a window, it really means we now cut up that final gradient here into the different chunks

for each specific word in that window, and that’s how we update our

first large neural network. So let’s put all of this together again. So, our full objective function

here was this max and I started out with saying let’s assume it’s larger

than zero so you have this identity here. So this is simple indicator function if. The indication is true,

then it’s one and if not, it’s zero. And then you can essentially

ignore that pair of correct and corrupt windows x and

xc, respectively. So our final gradient when we have these kinds of max margin functions

is essentially implemented this way. And we can very efficiently

multiply all of this stuff. All right. So this is just that, this is not right. This is our [INAUDIBLE] But you still

have to take the derivative here, but basically this indicator function is

the main novelty that we haven’t seen yet. All right. Yeah.>>[INAUDIBLE]>>Yeah, it’s a long question. The gist of the question is how to we make

sure we don’t get stuck in local optima. And you’ve kinda answered it a little

bit already which is indeed because of the stochasticity you keep making updates

anyway it’s very hard to get stuck. In fact, the smaller your,

the more stochastic you are, as in the fewer windows you look at

each time you want to make an update, the less likely you’re getting stuck. If you had tried to get through all the

windows and then make one gigantic update, so it’s actually very inefficient and

much more likely to get you stuck. And then the other observation

that it’s just slowly coming through some of the theory that

we couldn’t get into this class. Is that it turns out a lot of the local

optima are actually pretty good. And in many cases, not even that far away from what you

might think the global optima would be. Also, you’ll observe a lot of times,

and we’ll go through this in some of the project advice in many cases,

you can actually perfectly fit. We have a powerful enough

neural network model. You can often perfectly fit your input and

your training dataset. And you’ll actually, eventually spend

most of your time thinking about how to regularize your models better and often,

at least, even more stochasticity. We’ll get through some of those. But yeah, good question. Yeah, in the end, we just have all

these updates and it’s all very simple. All right, so let’s summarize. This was a pretty epic lecture. Well done for sticking through it. Congrats again, this was our super

useful basic components lecture. And now this window model is actually

really the first one that you might observe and practice and

you might actually want to implement. In a real life setting. So to recap, we’ve learned word vector training,

we learned how to combine Windows. We have the softmax and

the cross entropy error and we went through some of the details there. Have the scores and the max margin loss,

and we have the neural network, and it’s really these two steps here that you have

to combine differently for problem set. Number one and

especially number two in that. So, we just have one more

math heavy lecture and after that we can have fun and

combine all these things together. Thanks.