Open Source AI with Python & Hugging Face

Encoder & Decoder Transformers

Temporal

Lesson Description

The "Encoder & Decoder Transformers" Lesson is part of the full, Open Source AI with Python & Hugging Face course featured in this preview video. Here's what you'd learn in this lesson:

Steve explains encoders and decoders in neural networks, where encoders convert words into numerical vectors, while decoders reconstruct words from these vectors to form sentences. He also discusses decoding strategies like Top-K and Top-p sampling to predict the next word in a sequence based on confidence levels.

Join Now

Preview

Transcript from the "Encoder & Decoder Transformers" Lesson

[00:00:00]
>> Steve Kinney: All right, so we got one more kinda conceptual, it's more of a extension of a conceptual thing. And then we can fine-tune some models and it's gonna be dope. So we know about encoders and we know about decoders, right? And decoders are effectively when we think a lot about one word at a time, and we're only looking backwards, right?

[00:00:22]
And we know encoder takes the word, turns it into the vectors, a list of numbers. We capture that meaning and we kind of start to figure this out. The words are gone at that point, right, we've turned it into these IDs. When it's looking at Cat, that has the idea of, I almost remember the number that Cat had, but whatever, doesn't matter.

[00:00:41]
It's now only working with those numbers. It doesn't really care, the neural network doesn't necessarily care about the words. It just literally has these IDs, it's putting in this multidimensional space, and coming up with the relationships, right? Once they are encoded and they go in the model, the words are gone, effectively.

[00:01:01]
So we need the part of, then, okay, I need to figure out what the next word might be. So decoder takes that and then tries to decode it back into a sequence of words. The goal is to produce a meaningful sentence that actually makes sense, which, if we even saw some of those early examples we didn't always get.

[00:01:22]
The best way to think about it is a zip file encoding is effectively zipping up all your data and then unencoding is unarchiving all of that data because ultimately those numbers will also be smaller than the words that they represent and they can be handed off to a gpu.

[00:01:38]
The GPU has no idea about words. It's not familiar with this concept of words. It's been mining crypto all day. It doesn't know what words are. And so then obviously coming back out is how we deal with that stuff. And so we said before, encoder-only is BERT kinda getting inside the transformer itself, for us, getting the words in and out that we have encoders and decoders.

[00:02:01]
But in the actual model itself, our GPTs, what we're trying to move towards now, when we go to fine-tune GPT2, is effectively only looking backwards. And it's not looking forward, it's the auto-aggressive piece, right? Bert is obviously just looking for the context around everything. And this is the.

[00:02:27]
We can actually see this where if we Show 1 into ChatGPT, we'll see, as every token, it will know about the previous ones, but it will nerf the next ones to a point where there's no relation at all. So it can only look backwards. So we start with some kinda prompt, model predicts probabilities for the next token.

[00:02:48]
Decoding strategy selects one token, and that token's added to the input, then we move on to the next token, right? So you can almost see that when you use ChatGPT, and that streaming is both also to make it like you do not have to wait forever to see a response.

[00:03:02]
But effectively it's making stuff up as it goes along statistically, right? It's saying like, okay, there's a token that separates your prompt to its beginning and it's going to guess what the first word it should say is, and then it's going to guess the next word and it's going to guess the next word until it feels like it will actually generate.

[00:03:26]
Like, actually, the next word is to stop talking, right? The next word is, I'm done speaking now, right, that was probably the next thing to do. And some of that's informed by what we saw, a max token length or something like that, right? In a lot of ways, you can use the ability to be like, we're gonna hit that max token soon, right, start winding this down, does play into the statistical model at some point as well.

[00:03:51]
We talked about temperature that plays in here as well, which is how much spiciness we want. And we can actually see that statistically. And we reference this top-K and top-p, that will all play in as well, which is like, how do we figure out the next word? Cuz it's always going to be the same word every time, right, you don't get a lot of nuance.

[00:04:12]
And you almost go down the same rabbit hole, like, a little bit of like. Because if a sentence started with the same three words, you'd always end up in the same place. So how do these two things work? So pure sampling. If you just said, hey, pick any other word that could end poorly for you.

[00:04:34]
Any word. No, Top case sampling is to pick. If you say 50 of the 50 most likely next words, go pick one of them. But we saw that sometimes we had a situation when we saw those earlier ones, the very beginning, where the confidence level was 0.99 for one and then like 0.01 for the next one, right?

[00:05:03]
So that even if you said, I'm gonna pick the top 50, the drop off after the fifth could be huge. And you get garbage words after that, right? And then the top P is like, okay, pick from all the ones that you had a confidence level above 90%.

[00:05:22]
Above 95%. Above 85%. Some of this is tweaking the knobs and seeing, right? And you can set both of these right. But generally speaking, the top P is the one you probably want, right? Cause, like, you're saying, like, I don't know if there's, like, five words that are above 95% I feel confident about, or 500, but, like, I should feel pretty confident about the next word, right?

[00:05:48]
That's usually the one you want, unless you know that you have a different one that you don't want.

Learn Straight from the Experts Who Shape the Modern Web

250+
In-depth Courses
Industry Leading Experts
24
Learning Paths
Live Interactive Workshops

Get Unlimited Access Now