Open Source AI with Python & Hugging Face

Fill-Mask with BERT

Temporal

Lesson Description

The "Fill-Mask with BERT" Lesson is part of the full, Open Source AI with Python & Hugging Face course featured in this preview video. Here's what you'd learn in this lesson:

Steve discusses the differences between models like GPT and BERT, with BERT focusing on bidirectional context. He demonstrates how BERT's fill mask function predicts missing words based on surrounding context.

Join Now

Preview

Transcript from the "Fill-Mask with BERT" Lesson

[00:00:00]
>> Steve Kinney: Another one we talked about is again, I said before, like that text generation is interesting because it is almost. And we'll see this visually in a little bit with the text generation, it's only looking at the past and it's making up the future as it goes along, just like most of us.

[00:00:13]
Fill mask is an interesting one. Bert kind of acts differently than GPT. BERT's job is to look to the left and look to the right, right? And use not only so like a film mask is effectively, instead of coming up with what is the next word, you're in the middle.

[00:00:28]
You've got the words that come after, you've got the words that come before. Given looking both directions, what is the most likely fill in the blank? It's not necessarily Mad Libs, because that would just be random, but this is trying to statistically guess what the right mad lib is.

[00:00:46]
Looking at the entire mad Lib. Before you give your nouns and verbs, the one thing to be mindful of is depending on the model, what they use as the mask. This is the token that is like fill in the blank. You do need to look at the docs for whichever one you're using and figure out what the right one is.

[00:01:06]
Here's some strings that we can use for. Again, I chose purposely to use the default models for all of these because one, they are small and two, they are the ones. If you didn't specify something, what you would get. It seemed like the responsible thing to do. And so obviously we can have one at the.

[00:01:22]
Is that actually not the end of the sentence? Because that period is a token, but given the context of the sentence. And again, when we get to attention and transformers in a little bit, we'll actually see how this works under the hood. But right now, let's just see it in practice.

[00:01:37]
First, Python is a popular blank language. Machine learning is a subset of blank intelligence. The blank of gravity was discovered by Newton and Shakespeare wrote the play Blank, right? And so it will use the context of the words around it, which is not what GPT does, right? It's what Bert does.

[00:02:02]
Why someone didn't make a model called Ernie, I don't know. So we'll run this real quick as it downloads the model. How big is this? 331 megs. And so at this point we have it giving a bunch of guesses as it goes along. So where I said top K is three, that's why you will see three guesses and you're like, what is top K?

[00:02:23]
There are slides. They're coming. I Promise you. But in this case, the good enough right now answer is we're asking for the three most likely choices and this is arguably for these very factual ones. We can change the argument for this very factual one. You probably definitely don't want to give it a lot of temperature and guesswork on what the capital of France is, right?

[00:02:52]
The thing it was the most confident about by an order of magnitude was Paris. But if you tell it to keep going, it's wrong.
>> Speaker 2: How do you find the right type of model for what you're trying to do?
>> Steve Kinney: Yeah, so if you go in there, you're looking for the right type model, right?

[00:03:08]
Like a lot of times for obviously for what you're trying to do, whether it's text generation, obviously there's 42 different tags or whatever, right? They are classified, tagged in some way, shape or form. It's called tasks is the term they're using. It doesn't feel like the right term, but tasks is the term.

[00:03:31]
So if you are just looking for all of the zero shot classification models, you can go into hugging face, you go into the models, you go into the task and you hit all right, here are the ones that have been labeled that they are good at or tuned for zero shot image classifications.

[00:03:49]
You can also pick like hey, I want to do text to text and I want to do that's zero shot image classification. If I wanted to do. I think there was a question answering one document question answering, you will get the models that are suited for that. And then like most libraries, it becomes a game of what vanity metrics are you interested in, right?

[00:04:10]
There are some very real vanity. Well, not vanity metrics. The real metrics of like how big is it, how many parameters does it have, when was it updated? Then you start playing the game of downloads and likes, right? In the same way, if I am looking for what NPM package I'm going to use, sometimes it is open them all up on npmjs.org which everyone has the most downloads and has been updated recently is the one that I try first.

[00:04:34]
But yeah, to the question of how do I find other models? A lot of times they have been kind of labeled and tagged for that given I guess task is the right word after all. It doesn't feel very fulfilling as being the answer, but allegedly. But yeah, you go in there, you can find it for what you're looking for and then you play a game of whether you want to do it by trending most downloads, most likes, recently updated, so on and so forth and kind of do that piece after that size and again, if nobody's using it, especially in a world where anyone can pull down a model, fine tune it and push it up.

[00:05:13]
You have nothing wrong with that, but you probably want the base ones. In a lot of cases I think you can usually see if it is forked from another model, so on and so forth. Yeah, the model tree. And you can see what models it comes from, so on and so forth.

[00:05:28]
So this is one that is pretty popular. No, not that one per se. But you can see where they're forced from, so on and so forth. But then it becomes. Yeah, like I said before, size obviously matters. Memory footprint parameters is gonna be more than memory footprint, which is going to be the limiting factor before download size is ever gonna be the thing that hurts you.

[00:05:47]
And then it becomes the usage. And the way you would evaluate any library but task is apparently the right answer to that question mark just doesn't feel very fulfilling. Cool, so, yeah, we've got the various things that it fills in here. As you can see, Python is the most popular Python language, but it did give that incredibly low score.

[00:06:08]
So you would set thresholds obviously, which again can just become a greater than, a less than symbol. In your code, the law of gravity was governed by Newton's theory, center of gravity, but again they get increasingly lower as time goes on. That one I'm at, it's not a lot of text in a tiny model.

[00:06:33]
I don't feel good about that one, but it doesn't feel particularly better about that one than the other ones either. So there's that a part of this is like I said before, the GPT models were not necessarily trained to look both ways. The BERT family of models were trained to look both ways.

[00:06:51]
And some of that both is in terms of the training debt, the training data, and then also the architecture of the model. When we see the GPT models later, effectively one of the things they do is when you look at the score of a word and how closely it is related.

[00:07:07]
The GPT models intentionally just nerf every future word. So they're all given a negative infinity score. And that's how you keep it from looking forward and only thinking about the previous words as you generate the next one. The BERT ones are kind of set up to weight things, so you look on both sides and get the right answer.

[00:07:26]
And so that's where the art of that, like why some models are good at some things and some are not, is they are either the architecture of that model or the tuning of said model is what gets you there. Cool. Then we've got summarization. This one kind of does what you think it does.

[00:07:47]
You can do stuff like the max length and the min length and stuff along those lines, but it will take strings of text. It will take the parameters you set, it will do the thing.

Learn Straight from the Experts Who Shape the Modern Web

250+
In-depth Courses
Industry Leading Experts
24
Learning Paths
Live Interactive Workshops

Get Unlimited Access Now