
Lesson Description
The "Encoding & Decoding Text" Lesson is part of the full, Open Source AI with Python & Hugging Face course featured in this preview video. Here's what you'd learn in this lesson:
Steve explores tokenization using different models like BERT, GPT2, RoBERTa, and T5, showcasing how each tokenizer breaks down text into tokens differently. He also discusses the process of encoding and decoding text, highlighting the uniqueness of each model's. tokenization approach
Transcript from the "Encoding & Decoding Text" Lesson
[00:00:00]
>> Steve Kinney: If we go back to the site that we had earlier, this is our second little notebook here on tokenization. So we actually see this with real code. So there's some of the words that I said earlier and we can kind of get a sense we can actually grab a bunch of different models of hugging face.
[00:00:23]
And we just ask them like yo, like what's your vocab size? Right? And So I grabbed Bert, GPT2 and Roberta and T5 and I basically went through all of them and I said how many tokens do you have? Right? And this auto tokenizer before when we saw pipeline it was all of the pieces of the chain, right?
[00:00:50]
The tokenizer, the, the model weights everything. In this case I'm just grabbing the tokenizer. So I'm using this auto tokenizer which is just kind of also getting whatever other accoutrements. I'm not wiring everything up myself by hand. But it's not the full pipeline in this case. And you can see that it is going to pull those down for me right now.
[00:01:11]
It's got my libraries, we're pulling those in. We'll have them momentarily as we spin up our little disk, so on and so forth. Here they come. You can see it pulling down the vocab, JSON, all the fun files and here we go. It's almost like that slide I showed you earlier.
[00:01:33]
I wrote code to figure it out. So here you can see, you can grab all of them. They are different sizes. What that means to you on a practical day to day level is not much, just the fact that models are different. One tokenizer is not the same as another tokenizer.
[00:01:47]
They are all usually at least somewhat bespoke to that given model. So we got those vocabularies and then what would it look like to start tokenizing things? And so here you can change between the different tokenizers if you want to see. So we take this string. You can also change the string to whatever you want.
[00:02:11]
We'll hit play. And so first of all it breaks it up into the tokens. And that's what Bert's tokenization looks like. Hello, we got the comma, the world exclamation point. Tokenization is the yadda yadda yadda. You don't need me to read you a sentence you can read, but one by one in tokens, right?
[00:02:29]
We can do the same thing again if we want to Change it to GPT2. And you can see that GPT2 has a slightly different way of demarcating the division between things. It's probably Just a special, like ASCII character that renders weird as it spits out. The important part here, of course, is that they are slightly different.
[00:02:52]
Right. T5 does it. Even the way that T5 broke up the words is different. Right. And let's just look at one more to further drive the point home. So Roberto just looks a lot closer to GPTs. Again, that is most likely a special ASCII code that happens to just render into a G with a XM arc on top.
[00:03:15]
And you can change it out. You try different things. You can experiment, you can. Like if we go back to one that looked a little bit better for us. All right, so those count as full words because they're probably common enough. Tokenization in the real world, when you're not talking about AI models, probably doesn't come up nearly as much as swimming does in the body of work.
[00:03:39]
The vocabulary is effectively a set of those suffixes and core words. But some common enough words are going to come up as we go through. You can try out different ones if you're curious of how they end up. Anyone else want to try one for funsies? You look like you have one on the top of your head.
[00:03:57]
>> Speaker 3: I have a question.
>> Steve Kinney: All right, let's do it. Let's switch to a question. What do you got?
>> Speaker 3: I'm curious, is there a case where you'd ever go between tokenized format to a different tokenized format? Or is best practice always go from text to token?
>> Steve Kinney: I mean, most of the time on the way into a model, as you notice, these tokenizers look like model names.
[00:04:16]
Right? Right. And so they are. Usually you cannot use tokens from Bert with GPT, right. So no. Was there ever a reason? I'm not gonna say there's never a reason, but no, there's probably not a reason.
>> Speaker 2: What does it do with something that's not in its vocabulary?
[00:04:33]
Like if you just put in a made up word or just like ASDF or something.
>> Steve Kinney: So this one will probably still get the whole word. I wonder when we get to the decoding part, if we actually get a value for it, if we end up with some.
[00:04:47]
That's interesting.
>> Speaker 4: That is interesting.
>> Steve Kinney: That's fascinating. There's certain things that you never think about until you're teaching and somebody asks you a question like that. I'm like, I don't know. That's a. Now what I want to do is stop teaching for the day and now go play around with a bunch of decoders.
[00:05:01]
I'm not going to do that right now. But I'm probably going to do that tonight. That's going to live rent free in my head. Yeah, go.
>> Speaker 3: The higher token count typically mean it's a better model.
>> Steve Kinney: Sure. You know, it's like one of those things where it's like does it being bigger and more parameters make it a better model?
[00:05:21]
Usually. But like not always. Right. Like it is a not wrong heuristic that I'm sure there are tons of counter examples for. You know what I mean? So like possibly. But keep in mind then the bigger the model, the harder it is to fine tune. You know what I mean?
[00:05:41]
Yes. Ish. The problem with computer science is that nothing is easy, everything is hard and most rules have an exception. That's really annoying. And then from tokens to IDs. So here we've got one from earlier. I'm actually going to. I'll probably try to remember to change this later.
[00:06:01]
If I do param and I do type string. I know, I know, I'm getting there. I can change it here really easily. Could it have been just as easy changing the code? Absolutely, but here we are, we can run this through where we can see it turn into those numbers.
[00:06:22]
Which now raises the question of what happens with my garbage word. I need to like break this into like other like I need to like have all these things in one thing so I could see the transformation all the way through. Because those are probably those like weird pieces that we saw earlier.
[00:06:45]
I don't know. This is gonna live run free in my head. I don't have an answer for you. But I will unfortunately to everyone in my family's chagrin, probably have one shortly because that's gonna. Yeah, I'm gonna be thinking about that while I talk now. Yeah. And so but to for the core point before because as you can see I'm still spending cycles on this now is obviously we split apart the words, we turn them into numbers.
[00:07:11]
Ideally the final step is to turn those numbers back into words, right. Actually hold on. So I think I have the way I did this because they're all in memory. So I've got the token IDs. Those I think in memory are still is garbage. And so then what happens?
[00:07:30]
It did manage to put it back. I guess like at a certain point the letters are in there is like my guess is at a certain point.
>> Speaker 3: And certain pairs of letters get put together as a single token for whatever.
>> Steve Kinney: Reason that would be. My guess is that like at some level individual letters are tokens.
[00:07:48]
There's enough of the common character yeah, let's go back up and look. Yeah, because in this case, like just L and the idea that it had stuff before, it just. Yeah, these have probably got broken up into tokens in those vocabs. I probably can't do much with this.
[00:08:07]
They're probably all like related to almost nothing in that sequence. But I think at a certain point there are small enough granular pieces in there because if you think about all of the. Obviously initial letters only get you to like 26 and probably enough of the common combinations because a lot of times it did fall back to a single letter.
[00:08:24]
Like of the 50,000 probably don't take up that much space. Right. So we all learned something together today. And that's why having these notebooks is super cool though. The idea that you can just like. I don't know, let's like run the code and see is super fun with these colab jupyter notebooks as we go through.
[00:08:47]
Cool. So let's see. Let's turn this back. Just my own. Just so I don't like. I don't remember exactly if that will bite me by having garbage words in there at some point. So yeah, here we can say we've got that again. That's the beginning. That is the separator in this case, if I remember correctly.
[00:09:10]
Hello world. I think it will just be around the actual again. Yeah, it does. Like the separation is just for that given thing. What you think about is like the prompts in a chatgpt are then separated by like, this is one thing. And then you know, again, in the chatbot ones, they are trained with special tokens to delineate what role it was.
[00:09:37]
But it kind of shows the start and end of a given thing which when we do the fine tuning, we'll see roughly when a quote should start and end and stuff along those lines and the format of it. Look, apparently previous me took notes. If truly. If you truly found a way to put something in there that it does not know at all, you will get a special token for an unknown vocab word.
[00:10:04]
Cool. And we saw like that might happen sometime later because like with these, these are when we try to match two of them is where we put the padding in. Right. So it'd be interesting to see. I got it. Like, I'm pretty sure you can visualize it. I just have to get creative later and see if I can come up with a way to visualize it.
[00:10:19]
>> Speaker 4: So I just went to look up like how many words are in English language and the entry from like Merriam Webster says, it's like, it's hard to estimate, but it's like could be roughly a million. Right. And these tokens are like 50,000 of them.
>> Steve Kinney: Yeah. So it'd be curious, I don't know if we know the token counts of like Claude Opus or whatever like these are if someone wants to like live research the Release date of GPT2, you know what I mean?
[00:10:48]
Which was none of us were like AIs changing everything with GPT2, you know what I mean? These are also several hundred megabyte models versus like I think we saw like hugging face or whatever. Like some of them are like I have models on my computer that are 46 gigabytes like my suspicion is a 46-gigabyte model most likely has a larger vocabulary and those are open source.
[00:11:14]
We could probably figure it out. I just probably nobody wants to watch me download a 46 gigabyte model right now. That said, well, not on the plane, but at some point that'd be interesting to kind of see because there are large open-source models. Just because we can't see Claude opus or Gemini 2.5 Pro does not mean we can't pull down like meta's libraries.
[00:11:37]
Their models are open source, right? So you can grab like llama like 3, the 105 billion parameter 1. And I'm curious to see what its vocab size. I mean I can't run that on my laptop unfortunately, but I'd be curious to see what, what its vocab size is.
[00:11:55]
>> Speaker 4: Have you played with Deep Seek like.
>> Steve Kinney: Like on a consumer level. Do you know what I mean? Have I like downloaded deepseek and like spun it up and talked to it? Yes. Right. I haven't like gone down into the weeds with it. Like I've done the I chat to it.
[00:12:10]
It reasons, it gives me answers level of it. But like not on a, like let's pull it down and like pull it apart. Right but like some of the larger open-source ones are somewhat interesting to do that with, right. Where you can just like yeah, you can pull down this 20-gigabyte model and you can like one load.
[00:12:28]
There are some apps like LM Studio or whatever where they pull down from hugging face and they'll pull down and they'll give you like a desktop app that looks like. It's your average electron app and like load them up and you can tweak a lot of the parameters.
[00:12:42]
Like how many like CPU cores do you want to let it use what, what is the temperature and top K and top P, which we'll see in a second. And you can chat with them. And it'll put the chat interface and it'll actually put a OpenAI compliant API layer on top of them too.
[00:12:58]
So you can actually take any of these larger 20, 40, 60 gig models, depending on again, the size of the RAM and or GPU will be the limiting factor, not the space on your computer. A model that is maybe 8 gigabytes is still going to take up like 32 gigs of RAM or something like that, depending on how you tune it, but those would be interesting to pull apart as well.
Learn Straight from the Experts Who Shape the Modern Web
- In-depth Courses
- Industry Leading Experts
- Learning Paths
- Live Interactive Workshops