Open Source AI with Python & Hugging Face

Batch Processing Multiple Strings

Temporal

Lesson Description

The "Batch Processing Multiple Strings" Lesson is part of the full, Open Source AI with Python & Hugging Face course featured in this preview video. Here's what you'd learn in this lesson:

Steve discusses the process of tokenizing and vectorizing text data using PyTorch, including the concepts of padding, truncation, token IDs, attention masks, and vocabulary mapping. He also discusses the model's role in translating text and the structure of vocabulary mapping in tokenization.

Join Now

Preview

Transcript from the "Batch Processing Multiple Strings" Lesson

[00:00:00]
>> Steve Kinney: So then we talked a little bit about what do we do with the multiple strings and how do we hello world versus hello world today, so on and so forth. We can see that padding. So we've got a batch of texts. Make sure I run all the various things in this case because if I don't run them, they won't run.

[00:00:24]
I'm just talking and not paying attention. That's actually marked down. So we've got batch_encoding. We take a tokenizer and an array of text or a list, if we're speaking in proper Python parlance, that we do want to have the padding and then some amount of truncation very long.

[00:00:39]
This PT is just what kind of tensors do you want? We're using pytorch. There's also TensorFlow and a few other ones. But just to make a decision, we chose to use Pytorch today. I will be very honest where that decision came from. A year, year and a half ago, when I was researching which one to use, because choice is hard, there was a bunch of people saying TensorFlow is if you already have it in Pytorch, if you're starting a new project.

[00:01:08]
I did not endorse that message, but I read it and then started using Pytorch. Now you do too, if you're running these. Cool, so if we have the cat sat on the mat and the cat. We should see the cat sat, pad, pad, pad. So we can actually go ahead and take a look at this in practice and play around with it.

[00:01:29]
Okay, so let's take a look what we have. So first we're gonna show the input IDs. Those are the tokens. So we have the cat sat and the cat sat on the mat. And so you can pretty much see we've got that probably that cls beginning of the statement token, the cat sat are the same.

[00:01:48]
Then we have. If you look that that is both basically probably the period in both sentences and then the extra three words and then the end of the statement, then what it does to make them effectively equal is pad the shorter one, right, With a bunch of zeros.

[00:02:14]
So now they are effectively the same length. So you can look at the relation. And then you do have for the actual attention mask. Well, that one has all the words. This one obviously has some empty space in there. So it tells you which ones are actually important and which ones are not, which used to inform the attention later on and where we should actually put those things together.

[00:02:38]
So we can take any given thing and you can see that there are consistent numbers for every given one of these tokens. We gotta find out. We need one where it shows us every step of the way through. Which didn't occur to me until that question. I thought about the happy path.

[00:03:01]
I did not think about the unhappy path. But luckily these notebooks are something you can play around with as well and kind of move around and change also. So I definitely encourage you to do them. You can easily. They're all shared. You can easily just like make your own copies and play around with them.

[00:03:14]
And you absolutely should be. But for the happy path, let's just run that again real quick. We can see that the tokens do have unique IDs. There is this idea of keeping track of which ones were meaningful tokens and which ones you'd be like, wouldn't the zeros get you most of the way there?

[00:03:28]
You could argue that. But like I think also having which ones are actually meaningful in a 1:0 format is powerful as well. Cool, and so effectively that is the process of tokening and vectorizing it and stuff along those lines. Yeah, awesome.
>> Male: So you're trying to guess at kind of which one translates to which character.

[00:03:44]
Is there a scenario where you look up a dictionary and try to understand if something's translated wrong?
>> Steve Kinney: No, that's the model's job.
>> Male: I figured it just-
>> Steve Kinney: That's the model's job, right? Yeah, ideally, right. Especially cuz it's effectively a mapping. I wonder, at the time I was clearing out all those progress bars because they were taking up a lot of room.

[00:04:07]
But you can see of the files that are downloaded, I'm not entirely sure where COLAB is storing everything. We could probably find it. But one of the things you could see it pulling down was a vocab JSON. My suspicion is that there's something interesting in there. I'm just not entirely sure where in this Linux file system that is pulled down into.

[00:04:35]
You can probably go into Hugging Face to look at the models too. It is most likely just a hash map. So I think I feel pretty confident that it's not gonna mess that up.
>> Male: Does that mean that the JSON will expand as you add new characters?
>> Steve Kinney: Or is it- I mean it's of the mapping of a token to the id.

[00:04:56]
Right.
>> Male: So there'll never be a token created that it doesn't know how to process?
>> Steve Kinney: I mean that's been our current thought exercise, right? We see the unknown token, we haven't figured out how to get to one, right? But there's a whole bunch of weird ASCII characters that we have not gone the dark road of like what will I be doing later today?

[00:05:13]
I'll be holding the option key and the shift key and and putting in crazy stuff like that apple and trying to see if I can get something crazy to happen, right? Do I feel the need to do that all in front of you for the next hour? I feel like I shouldn't.

[00:05:31]
Feel like I shouldn't. I feel like I should, but I know that I shouldn't. And Dustin will look at me if I start any longer and redirect me. So we won't. But the larger scale or the high level thing is that we break the words apart into tokens.

[00:05:50]
The tokens become numbers. We'll do something with those numbers momentarily to figure out meaning and then we will string them all back together when it comes time for a response, right? So what happens when you have too many like minded people in the room as you start going down the road of mad science?

Learn Straight from the Experts Who Shape the Modern Web

250+
In-depth Courses
Industry Leading Experts
24
Learning Paths
Live Interactive Workshops

Get Unlimited Access Now