Open Source AI with Python & Hugging Face

Preparing & Loading the Dataset

Steve Kinney
Temporal
Open Source AI with Python & Hugging Face

Lesson Description

The "Preparing & Loading the Dataset" Lesson is part of the full, Open Source AI with Python & Hugging Face course featured in this preview video. Here's what you'd learn in this lesson:

Steve discusses fine-tuning a model, specifically focusing on preparing a dataset of quotes for training. He emphasizes the importance of selecting a GPU runtime and provides insights into data preparation, model quantization, and utilizing a pre-trained model like GPT-2 medium. Steve also mentions strategies to optimize memory usage, such as quantization, to stay within the free tier limits.

Preview
Close

Transcript from the "Preparing & Loading the Dataset" Lesson

[00:00:00]
>> Steve Kinney: So we're gonna go into this fine tuning notebook and C onnect. I have too many sessions. You might have gotten this earlier. So I'll show you what to do. Manage sessions. I've opened up many a notebook at this point. Terminate other sessions. We'll go ahead and we're gonna hit connect.

[00:00:32]
I am actually probably gonna grab a. I'm gonna change the runtime personally just for all of our sakes and grab a beefier graphics card. Just simply like this will work probably in like five or ten minutes on the cheaper one. Again, not bad considering what it is. But I don't wanna wait that long.

[00:00:57]
So we talked a little about this. This is my own notes before I made the slides. You do need to be on a GPU for this one. So if you are like, you should at least check, go to runtime, change runtime, type. You probably only have CPU and T4 GPU.

[00:01:11]
Choose the T4 GPU. I'm gonna sit on an A100 because I'm worth it cuz I pay the 10 or $20 a month in it account. And honestly you're the real beneficiaries of this. So go ahead on this other runtime. I'm just going to make sure we've got everything downloaded.

[00:01:30]
All right, so first we are going to, quote, unquote, find and prepare our data which I grabbed this data set which is just quotes from people. So it's like a quote from Oscar Wilde, a quote from Albert Einstein, a quote from Dustin. So we're gonna grab this data set.

[00:01:48]
If I was a nicer person to myself, I would have linked to it at the time. But let's go and we can go and we can. No, I don't want a model. I want the data set. There it is. And so in Hugging Face you can actually kinda see the data set.

[00:02:17]
So this data set is effectively a spreadsheet where, you know, it's probably JSON really, but we've got a quote from Oscar Wilde, the author and some tags I'm not actually gonna use. I'm just gonna use the quotes and whatnot. And what I would say as a challenge to you that you should 100% do is I grabbed this one cause it was there to grab.

[00:02:41]
Seem there are a ton of data sets. So you can literally grab any data set you want, right? And find a different one. There are no shortage of data sets. Go here, just go to text. There are 300,000 different data sets. Grab one that makes you happy. You can find out what the cauldron Is whatever.

[00:03:08]
No idea, never clicked on it. But I grabbed this English quotes because it's the first one I saw. And I'm gonna put them in this format. I'm gonna say Quote by (author): (quote) and then I'm gonna say end of statement, right? And that's gonna help the model predict.

[00:03:27]
I should stop talking now. Cause that is a token that I'm including in there, right? And I've got the data set. We're going to load the data set in. You can see I showed you what the first one of the JSON is. We've got the quote, we've got the author.

[00:03:47]
I don't actually care about the tags. They're there though. Then for the data preparation, all I'm doing is I'm taking this JSON object and I'm turning it into a string that looks like this. When we say prepare the data, I'm just making a string. That's what I mean by prepare the data.

[00:04:07]
We go ahead, we make a string and we can go ahead and effectively we're just going to map. I got to run all my cells, otherwise they're not in memory. That one ran. Little checkbox knows that they ran. That one ran. There we go. So it ran through.

[00:04:33]
No, it's only 2500. That bigger number is how many I did per second. And so we ran through all those, we just formatted them as strings. So actually this is a relatively small data set, right. I remember when I think I was just glancing at it earlier, I just didn't look hard enough.

[00:04:53]
That's the number of records that 2805. That's just how fast it went. So really we're only taking a relatively small Data set of 2,500 pieces of strings and we are just going to hammer at GPT2 medium until it gets the point. So we'll pull in GPT2 medium and I'm gonna like, I'm doing some, I have the fancy pants gpu.

[00:05:21]
I'm doing some quantization. What is quantization? That is those numbers can be 32 bit or whatever. We're gonna say that you're all gonna be four bit. So you lose some granularity in that giant space of how many knobs you have to turn and stuff along those lines. But you can fit it in smaller amounts.

[00:05:43]
So if you grab an open source model, like a big boy model, a big girl model or big person model, like Llama 3 or whatever, and you're like, I want all the parameters and your MacBook's like, absolutely not. You can get like at 16 bit, 8 bit, 4 bit and you will lose some fidelity because that's how that works.

[00:06:09]
But you can fit it in memory. So some of this stuff is to keep you on the free tier. So I'll like bring it down to four bits, so on and so forth. I'm trying to think, whatever. All right, so then I will pull in our model name which in this case was gpt2-medium, right?

[00:06:26]
Again, I'm not using the full pipeline. I'm just using this like get me the stuff I need for a pre trained model for the quantization. This is a library called bits and bytes. It does the quantization trust remote code. It's not my machine. And we'll pull it in.

[00:06:42]
We won't use the cache. Great. This is a method that's just pulling the model and getting it for me. You could just grab the model normally. But if I run this more than once, things are bad. So I'm just getting a new one every time because I was running it more than once last week.

[00:06:59]
We don't need a cache. Then we're gonna get the tokenizer of the same model because we need to turn it into the same tokens. So a pipeline will get you the tokenizer, it'll get you the model. Do all the encoding and encoding here. I've grabbed the model and I've got the tokenizer for it.

[00:07:19]
And we know that we can see what that tokenizer's got to run this code, what its end of statement token is. And so I'll actually print it for a second here too. So for GPT2, the end of text token or the end of statement token is this very heavy metal looking end of text.

[00:07:45]
Our sample data effectively will really actually be that heavy metal looking one. But we're going to keep feeding it 2500 sentences that look like this that is going to turn the knobs. So if it sees Quote by. We've already started to prime the statistical model to start going in the direction that we want it to go in.

[00:08:13]
Effectively we'll go through. We get them all. We've seen all of this. There's my own notes and we have again some of what's going on. This data_collater basically breaks it up into smaller pieces so that we load it on there in chunks. A lot of this is again the hardest part of this was like keeping it on the free tier.

[00:08:38]
Honestly.

Learn Straight from the Experts Who Shape the Modern Web

  • In-depth Courses
  • Industry Leading Experts
  • Learning Paths
  • Live Interactive Workshops
Get Unlimited Access Now