Open Source AI with Python & Hugging Face

Tokenization Overview

Temporal

Lesson Description

The "Tokenization Overview" Lesson is part of the full, Open Source AI with Python & Hugging Face course featured in this preview video. Here's what you'd learn in this lesson:

Steve discusses the concepts of tokenization, encoding, and decoding in the context of AI models. Tokenization involves breaking down text into smaller units, converting them into numeric representations, and analyzing their relationships in multi-dimensional space. Encoding translates tokens into numbers, while decoding reverses this process to reconstruct the original text.

Join Now

Preview

Transcript from the "Tokenization Overview" Lesson

[00:00:00]
>> Steve Kinney: So we spend a lot of time talking about all the things you can do and ways you could do stuff. We didn't really get super far into it, like how does any of that work? Right? Which is the thing that we kind of need to do before we can like fine tune a model to bend it to our whim is now we know the things you can do with the model.

[00:00:17]
That's neat. We hand wavy. I think we know how they work. Let's actually get into how they work and then we can weaponize that for our own nefarious purposes. So there's kind of two things we need to talk about. One is this idea of tokenization. And then we'll get into encoding and decoding and then attention.

[00:00:41]
A lot of these models that we see today in the current AI hype cycle are based on this paper that came out of Google in 2017, 2018, around then called attention is all you need and introduces this idea of Transformers, right? So we are going to start with tokenization and we'll get all the way through attention and Transformers.

[00:01:03]
We'll do a little playing around some other notebooks to explore this stuff, but we'll get that conceptual framework and then we will weaponize that in order to figure out how to fine tune a model to do the thing that we want to do, right? Because as I said earlier, and I will say like three more times in the next hour, sometimes a really big model that's a general purpose thing that can do almost anything is really good.

[00:01:27]
Sometimes a tiny little lightweight, low resource model that you can fine tune super easily. Because the bigger the model, the harder it is to fine tune it, right? So the smaller the model, the easier it is to kind of tune. And we'll look at even ways to better tune it.

[00:01:43]
But first we need to understand how all these things work under the hood of all the things we just saw, so that we can then leverage that. So the first of those topics that we'll talk about is this idea of tokenization. You've definitely probably at this point heard the word token at some point.

[00:02:00]
And basically tokenization is a process of taking some big wad of text and breaking it up into smaller pieces, right? And that's not always just words, right? Because sometimes, we can break it into pieces. Like if you look at the word can't, right? There's kind of two pieces to the word can't, can and not, right?

[00:02:25]
And that right part is like, well, yeah, that kind of works with isn't too, doesn't it or wasn't, and a lot of other words. So like, in those cases, you break those kind of terms up into maybe two tokens, right? Or a lot of, you know, tokenization, for instance, right?

[00:02:39]
Token and then ization might be two different. So to say that, like, a token is a word is, in a lot of cases, true, but not always, right? And it all depends a lot of times on the model and the tokenizer and stuff along those lines. Because different models will choose to do stuff differently.

[00:02:59]
Like common words that maybe appear all the time, even if it fell into one of those categories that we just said would, might actually still be one token. The answer, it depends. But the idea is that we take these things, we break them up into these tokens. These tokens have numeric representations, right?

[00:03:15]
And then we do math to find out how they're related to each other in multidimensional space. And you're like, I'm a liberal arts major. Me too. So don't worry about it. It'll be fine. And so the process of tokenization is we break down text into smaller words, subwords, sub.

[00:03:32]
Sometimes a character, right? And then we kind of convert each one of them into some kind of number. We have some special tokens which might be the start or end of a sentence, other meaningful things. In the case of something like a chat, one that's trained for chat, like the tokens that separate the assistant from the user from the system prompt, there are special tokens you don't ever see, but they are in there to kind of draw invisible lines of demarcation between sections, right?

[00:04:02]
And then we also then figure out various different ways to find out the importance of various words. Cuz a lot of ways this is gonna work, not to ruin some of the surprise for you, is it's about any given word and its relation to words around it, right?

[00:04:17]
And there's some words that we probably that are very related to other words, like the word the that shows up around a lot of words. A lot of words have relationships to the word. The not particularly important relationships. Sometimes maybe, but not always. And so then we do various different things to try to figure out, one, what is the relationship of that word to the other words in the sentence?

[00:04:43]
And then two, how strong are those relationships? And then given those relationships, what could that word mean? And I will use this on a later slide, but one of the examples is the concept of going down to the banks of the Mississippi river versus robbing a bunch of banks tomorrow, right?

[00:05:00]
Same word. Context matters, right? And we as humans just have brains that happen to do that, right? Computers don't. So we had to figure out how to teach them to. And that's effectively the higher level concept that we're going to talk about. Like I said, there are various different tokenizers.

[00:05:23]
You do not need to memorize this. There will not be a quiz on it later. Just to say, for some of these, for some of the ones we saw earlier in our journey together, they have how many vocabulary is how many tokens does it know about, right? And then how does it do it?

[00:05:41]
It's not really on a practical day to day basis. All you kind of need to know is that they're different, right? They have different vocabularies. And obviously GPT2, you can just imagine has a smaller vocabulary than GPT4, right? And all those things matter to a certain extent, but not necessarily in our day to day usage.

[00:06:03]
Just knowing that one, that you might see differences across the different models and have to change your approach is totally true. Does it affect your day to day life? Most days it does not. Cool, so then we split things in tokens. In this case for whatever library, possibly a tokenized method, you can treat it as pseudocode at this point which will take a given word and split it into these tokens.

[00:06:31]
So tokenization might, I think this one is GPTs. I think it's GPT2's. If I'm doing this from memory from when I made the slides, is how that would break apart tokenization into two words. And you can kind of see that there is at least a little bit of a hint that one of these is a suffix, right?

[00:06:50]
I'll let you guess which one is the suffix. But you can kind of break and then you can figure out. Cuz token, you don't wanna just figure out tokenization's relationship to other words. Right. Sometimes. What does token mean? Like swimming and swim. You want to definitely like swim is probably important context for the other words around.

[00:07:07]
You don't want to have different ways of swimming than you want to have for swim versus for swam. Swam would technically be it's own token. But whatever. Like the idea is that you break them apart so they are meaningful. So a token is effectively, you know, a piece of a word.

[00:07:24]
What is the heuristic that if you're just trying to guess how many tokens something is. It's like 0.75 the number of characters. That's just a heuristic. That's not true. But it's like not as many Characters are in it because there are small words, there are big words, some words will get broken up into two, so on and so forth.

[00:07:41]
So it's a close enough representation. The point is we break apart words, we turn them into numbers, and we have these special tokens. So then when we encode them is when the process of taking those tokens that then map to a number and like the same word will always map to effectively the same number.

[00:08:01]
And we have these special tokens. Is it the same word or the same token? Same token. Good catch. Absolutely. Same token, not the same word, will map to the same ID at this point, these special ones have known IDs classification token is at the beginning of a sequence.

[00:08:19]
Separator token is to separate the segments, two different sentences, we'll have a separator in between them as well. These start to give it a sense of that shape and the ability to figure out where to stop and end and how this stuff works. In the case of something like a overly simplified version of if you're building your own chatgpt kind of like model, there will be those special ones that will say, okay, here's the beginning of the assistant message, end of the assistant message, beginning of the user message, so on and so forth.

[00:08:48]
Things that kind of act as lines of demarcation between the tokens or tokens in and of themselves, right? But it all becomes a set of numbers. So for instance, you might see something like hello world. Obviously you go to encode it and you will end up with the IDs of all of those tokens.

[00:09:06]
If you then were to go decode it, you lose a little bit of meaning. So obviously you lose the capitalization. Because we don't think that hello means in some cases, depending on the vocabulary, certain things, all caps do mean something different than the lowercase version. But generally speaking, right, we break it apart and we will have first of all the addition of the special characters.

[00:09:31]
But then like, you know, and the comma and the exclamation point are also tokenized as well. And we break it apart and we kind of can see if we take a given string of text, encode it, and then immediately decode it, like we gain some stuff, we lose some stuff along the way, right?

[00:09:46]
But that's effectively kind of showing, like how this process will normalize a bunch of those words. So encoding turns into numbers. You can take a lucky guess what decoding does. Decoding is able to then turn it back into a string because sure, your prompt goes in, it gets turned into a bunch of numbers stuff happens to figure out what the next set of tokens should be in terms of numbers.

[00:10:11]
But if you just Type something into ChatGPT and you got back a bunch of numbers, you'd be like, well, I'm never using this again. So then there is the process of putting it back together and that is the decoding process. And different models will use it differently. You can imagine that the ones that are looking forward and backwards will engage with this differently than the ones that are only looking backwards and trying to generate the way forward.

[00:10:38]
Is this where model watermarking for outputs happens too, when they're decoding to print a screen? Basically. Are those proprietary libraries or are those published as well? Depends if it's an open source model or not, right? Insofar that I know that obviously GPT2 is. I think GPT3 had a model card and then at some point they became less OpenAI became less open about AI.

[00:11:06]
So in this case you can take a series of tokens and you decode it back in the token id. No surprises here.

Learn Straight from the Experts Who Shape the Modern Web

250+
In-depth Courses
Industry Leading Experts
24
Learning Paths
Live Interactive Workshops

Get Unlimited Access Now