Open Source AI with Python & Hugging Face

Decoding Strategies for Text Generation

Steve Kinney
Temporal
Open Source AI with Python & Hugging Face

Lesson Description

The "Decoding Strategies for Text Generation" Lesson is part of the full, Open Source AI with Python & Hugging Face course featured in this preview video. Here's what you'd learn in this lesson:

Steve discusses decoding strategies, such as greedy decoding and temperature adjustments, demonstrating how these impact the model's output creativity. He also touches on the importance of end-of-sequence tokens to control model output length and the distinction between generative and extractive text generation approaches.

Preview
Close

Transcript from the "Decoding Strategies for Text Generation" Lesson

[00:00:00]
>> Steve Kinney: So in this example, what I'm going to do is honestly just going to create five rando tokens because it doesn't really matter. And we're going to see like how it's going to create a little chart for me that is going to show how it will think about the five tokens as it processes each token.

[00:00:20]
Which is going to go about how you expect, right? Which is when it is on token one, how many tokens do you think the decoder how many tokens do you think it's going to care about? The answer showed up, right? So at this point, which is random tokens that don't mean anything with a causal attention mask where we basically zero out all future tokens, right, as we saw before.

[00:00:47]
At the first token is processed, it only cares about the first token. And it's only considering the first token in that understanding, right? Process the first token. Now it only and it basically just nerfs any other tokens out of control. It does not care. So at first, we only care about the first token.

[00:01:06]
And then when we try to come up with, we come up with a second token. We try to come up with a third token, we only care about the first two. Bert will look in both directions. A GPT is only going to look backwards to figure out how to predict, but then it will use all the ones that are predicted previously to inform the next one along with all of your prompt.

[00:01:25]
When we think about, I said this earlier, when we think about like ChatGPT or Claude, we think about I ask question, it gives me answer. It must be like question answering. No, it's basically taking your entire text, drawing a line of demarcation, looking back at it, guessing the first word, it should say guessing the second word.

[00:01:43]
And you see those reasoning models like O3 or doing thinking in Claude where they will spend a certain amount adding context themselves before they spit out the answer, right, reflecting on it. But that's all just text generation getting put onto the stack, effectively. Then all the thinking that it did, right, is getting stuck on there as well.

[00:02:07]
And so that's also it's using all that thinking to predict what the next word should be, right? And so that's just a level of sophistication. Because all it ever works on is, is all of the tokens up until now, what should the next one be? Now we've got GPT medium.

[00:02:28]
Pull that down, you can see there's that vocab JSON again. There's also of a given one you noticed before we were using GPT2, now we grab GPT2 medium. So there are also sizes. If you go grab the open source LLMs for LLAMA 3 there is an 8 billion parameter version, a 14 billion parameter version, a 27 billion parameter version.

[00:02:57]
40, don't quote me on this. 105, I can't run on my machine, so on and so forth. So you can get not only different models in terms of generation, you can obviously get different sizes of models as well. Cool, cool, cool.
>> Male Student 1: Have a question just about running models.

[00:03:15]
So we've been talking about different parameter sizes. Is this confined to actual physical memory or can it do some swapping on faster virtual memory?
>> Steve Kinney: I think for most things, it's like you can swap something between RAM and GPU memory. But I think for most, it's got to be loaded in some sort of memory.

[00:03:34]
Yeah, it's tricky because on a Mac, I don't even have two kinds of memory, right?
>> Male Student 1: I'm just thinking it fakes, even if it's faking it and writing it to disk, [CROSSTALK].
>> Steve Kinney: Yeah, I don't think it can like page disk at all, right? I think the entire model has to be even more gets.

[00:03:50]
I don't, yeah. The tricky part is when I've run these things on my machine, a Mac does not separate GPU RAM from system RAM. So I always like forget in this case, right, as well. But where are we going with this?
>> Steve Kinney: All right, so we've pulled in pad token.

[00:04:17]
We've pulled in our model at this point. So there's a bunch of decoding strategies. And this is where we can start to play with, I don't need all of this right here. We've got the greedy decoding strategy. Just pick the single most likely less token. It's fast and predictable.

[00:04:33]
We but we will get mostly the same thing every time. The other thing we can do is start to play with that temperature, right? Let's actually run all of these and just see some stuff in here. We'll let all that go for a second. So here I'm having it run multiple times.

[00:04:52]
We saw this at the beginning. This is a temperature of roughly, at this point, zero. It's just using that greedy immediate next one. I have a max token limit of 20. And so it did stop at one point, if we bump that up to maybe 22, given this particular answer.

[00:05:18]
>> Steve Kinney: Give that a second. We will likely see two of the exact same sentences when it comes out. Yeah, right, and so there is no guessing at this point. It literally will begin to like, you will always get the same thing every time because it is picking the most likely next word.

[00:05:38]
Now, you can have some questions on why those are the next words. I cannot answer that one for you. But statistically speaking, that according to GPT2, did anyone look up when GPT2 dropped? It might have been like happier times, yeah. You will get effectively the same output. And so then we have that temperature.

[00:06:02]
We accidentally saw this a little bit earlier as well when we played around with it, where we will get slightly different ones. But this graphs it out for you. So for all of the words, what was my initial beginning one? What did I write here? A lot of that's the graphing stuff.

[00:06:19]
I think it's that, yes, the artificial intelligence is. So a lot of this code, if you're like, what is all of this, it's drawing the pretty graphs. Don't worry about the code. It's just drawing the pretty graphs. So you can see for each temperature level what it thinks the next, what his choices of the next word should be.

[00:06:47]
At a very low temperature, it doesn't have a lot of options, right, and a lot of creativity there. As the temperature goes up, the amount of words it's willing to consider and sample about to be the next word, right, goes up. And so that's where that creativity comes from versus always getting the same thing.

[00:07:08]
The lower you turn that knob, the almost like turns the overall volume down. So you've gotta be really loud to get through. And as you turn the temperature up, you can get more noise, right? And so the higher the temperature, the more words are statistically available for it to pick.

[00:07:28]
Which will either get you more creative answers or complete garbage, right? And that's why it's a dial, right, that you can kind of set as well. By default, you usually see 0.5. If we did it at 0.7, you get something a little bit better than when we saw originally at the lowest setting.

[00:07:50]
The future of artificial intelligence is a wider, more complicated topic. What kind of services and capabilities will be needed to support the growing demands of AI in many areas of everyday life? Yeah, it's GPT2. So it still reads those tests that you took in third grade, the reading comprehension standardized test.

[00:08:06]
It's not necessarily an interesting read, but you can kind of see with the graphing how its choices of the next word tends to come up. And then we have that top K and top P Nucleus name. And again, I will remind you that these are all notebooks where you can go in here.

[00:08:24]
You want to change that to a different number. You want to turn on the top K. You can go ahead and try out all of these things and you can kind of see how the differences will show back up.
>> Steve Kinney: The fun part is you never really know what you're gonna get with these things.

[00:08:42]
The future of artificial intelligence is like a trap. Its greatest threat isn't foreseeable. It's between you and, it says Jon Goodman. I think John Goodman, the actor has an H in his name, though. This might be like it's not totally wrong on that one. But, like, one of the things that jumps out at me that'd be interesting to do is you then take that output and then throw it in that, like, named entity recogn and figure out what comes out.

[00:09:08]
Always the combination of things becomes the most interesting. But yeah, we could also, what would happen if we just turned this down to 0.1, right? We don't give it as many things. I wonder if we end up something very close to what we saw in the very beginning, right?

[00:09:31]
That's very much like Bart Simpson drawing on the board, right? But again, it's just where we turn these knobs to get the outputs we necessarily want to get. Like, what happens if we turn this to, like, I don't know, it was like 0.92 originally, right? And then we'll turn that down and we'll turn up the top K to kind of like, view the difference.

[00:09:57]
Now we're a lot more positive.
>> Steve Kinney: And it's interesting once you kind of know that every previous word predicts the next one. That is almost what makes it beautiful, which is it will continue on whatever weird tangent it decided to go on to a certain extent, right? It knows it's made up, possibly a real person, possibly at a real company.

[00:10:25]
And it will continue with that because each word informs the next word. Right, and you do have a corpus of previous words that it will at least more likely to stay on track until it starts blowing through its context window and forgets the original words. Because if it gets too long, it will start dropping the packets at the beginning, and that's when the hallucinations happen.

[00:10:43]
But like most lies, as soon as you start down the path, you're pretty good at lying until you've lied for so long you forgot the original basis, and that's when your lie falls apart. The same is true with statistics and AI. Let's go turn that one back down to zero.

[00:11:00]
And what happens if we, I think that one ends up being an integer, if I'm not mistaken. I might get yelled at, we'll see. This is what happens when you load a JavaScript engine. Yeah, no, that one was pretty boring again. Let's see, only give it a top K this time, because I think maybe the top P was also playing around with that as well.

[00:11:30]
>> Steve Kinney: That's gonna be, that's a bumper sticker waiting to happen. All right, I think people should be smarter. They should be smarter than us. Here's your challenge. You wanna make money? This is how you do it. You just get the lightest weight model, you turn the knobs all the way up.

[00:11:54]
And then you just print Instagram ads for T shirts, right? Yeah, I think low quality models plus some free time could definitely make you a T shirt company or bumper sticker company. If anyone would like a free how to get rich from this workshop, I think AI generated meme stuff that just makes literally no sense.

[00:12:22]
Yeah, so this is taking any of the top 60 most likely words after the previous ones and going for it, a research outfit which wants to make us better than our ancestors, right? Because with the top K, all I think of the first of the 60 most likely words are fair game, and it will sample based on those.

[00:12:43]
So, yeah, top P better in a lot of cases. We're talking about how Gemini 2.5, and Claude 4 Opus, and 03 have 2 million tokens of context window. Our good friend GPT2 over here has a context window of 1024, right? And so it will start losing things off the beginning of the stack if it goes on for too long.

[00:13:08]
Which point, that chain of thought of making stuff up, it will lose the original seed of that. Which could end poorly for you, it might not. Yeah, like, and then like we said before, some of those special tokens come into play, right? And so, in some way, it's guessing another word, right?

[00:13:33]
It can plausibly guess in a lot of cases, I should probably stop talking now. Which is a thing, it took me decades. And my wife would argue one day, maybe, I'm trying to learn myself. Which is this idea of like, it could have an end of sequence token where you can actually say, hey.

[00:13:53]
Because again, the tools don't always know to stop. So you have to say, okay. An argument that you can give to this model in the pipeline right now is, hey, this particular tokenizer for whatever model you're using. This is, I think if you hover over it, yeah, the end of sequence token for, am I using GPT2 medium, I think?

[00:14:17]
The ID of that token is 50256. So when you come across that token, even if we haven't hit the max tokens, stop talking, right? And that's a valid token as well. So then we, hopefully, and again it becomes a game, right? We're going to say your max new tokens are 100.

[00:14:52]
Like, it might, yeah. If we didn't come across one of the like max new tokens, we can jump this up to 200. Hopefully, what you would find is that it will, like if it hits the max first before it hits an end of sequence token, it doesn't know like to stop.

[00:15:06]
But like hopefully if you give it enough room, it will eventually come across one of those. We could also change the prompt or something to a little shorter. But like I said, these are all to be played with as well. But my patience for letting it go is going to run out real fast.

[00:15:36]
>> Steve Kinney: We got very,
>> Steve Kinney: We got duplicate up there as well. But do I have a, also might have a memory issue I gotta look at too. Like we have the things overriding. But yeah, anyway, you can define the end token. And hopefully, you will kind of get a sense of, it will figure out how to do that as well.

[00:16:03]
Now we talked about before we saw the generative is making stuff up and the extractive will actually find it in the text. Right, those are two very different bases.

Learn Straight from the Experts Who Shape the Modern Web

  • In-depth Courses
  • Industry Leading Experts
  • Learning Paths
  • Live Interactive Workshops
Get Unlimited Access Now