Lesson Description

The "Transformers Overview" Lesson is part of the full, Open Source AI with Python & Hugging Face course featured in this preview video. Here's what you'd learn in this lesson:

Steve explains how transformers work, including the core components: embeddings, transform blocks, and output probabilities. He also discusses how self-attention allows tokens to consider surrounding words to refine their meaning in context, such as distinguishing between "river bank" and "savings bank."

Preview
Close

Transcript from the "Transformers Overview" Lesson

[00:00:00]
>> Steve Kinney: So once we have our tokens and the IDs related to those tokens, we get into this whole Transformer thing, right? And like I said, in the pattern that we've been seeing so far, we'll do the conceptual stuff and then we'll pull it into a notebook and we'll kind of see and play around with it and see also new and fun ways to break stuff that I didn't think about until I had a bunch of friends in a room challenging other ways to think about that will now live rent free in my head.

[00:00:32]
So there will be some conceptual stuff and then we will see it in practice, my sense was kinda doing the two passes on. It was the right choice. That's stock photography, not AI art. I would like the record to show. But there is everyone's requisite picture of a Transformer.

[00:00:50]
So now we can all stop thinking about Decepticons and Autobots and move on. So Transformer is effectively the popular approach that has kind of kicked off a lot of what we see these days, a 2017 paper, apparently. Attention is all you need that came out of Google and it is the one powering all these text generative models, not the extractive ones like OpenAI's GPT and Meta's Llama and Google's Gemini and Anthropics Claude, so on and so forth, right?

[00:01:28]
It's funny because I think we think about originally OpenAI and GPT, even though the paper came out of Google originally. That's just how stuff works. And they are good at understanding these long range dependencies within sentences, right? Again, not just thinking about the word before and after it.

[00:01:44]
Because as we know as people who speak English, a lot of times something earlier in the sentence means something later in the sentence. But as we saw before, what these models do is they don't look forward, they're only looking backwards. They're only looking at the previous tokens to guess what the next token should be.

[00:02:11]
That is Core versus something like bert, which is looking both ways. As we saw earlier with those fill masks and all of those fun things. The one that is currently in the popular imagination is only looking backwards. And we can actually see that in a graph in a little bit.

[00:02:26]
And so the whole idea is there are three key components. We've got this idea of an embedding, a transformer block and the output probabilities. The output probabilities we've kind of already talked about, there'll be a slide for it. But are the basically like, hey, what do we think the next token is going to be what is most likely the next token given the previous one.

[00:02:50]
And the embeddings are kind of what we talked about, where we turn it into those vectors. So we kinda know the beginning, we kinda know the end. The only kind of murky part is this middle in the center. So again, embedding is breaking the words apart and turning into numbers.

[00:03:04]
As we kind of saw in that information, there's some positional information as well in order to capture the order of the words. And so we've got the semantic meeting and the positional information. So then we get this attention mechanism, right? And it's basically allowing any given token to communicate or at least be aware of.

[00:03:27]
How about that of the other tokens to capture that contextual meaning. Again, the difference between the banks of the old Raritan versus let's go rob some banks tomorrow, right? Like the other words in the sentence add a lot of context to what the word banks means in that given sentence, right?

[00:03:49]
And so what happens is we can kind of like look at each one of those and begin to map together based on the other words. Kind of almost move the meaning, right? Like in a lot of these vector databases, it's like each one of those tokens, numbers is in relation to all the rest of them.

[00:04:04]
And we kind of use when we say a multi dimensional space. Good luck picturing that. But it is the ability to kinda like, where is it in relation to the other words? Did they say thieves? Okay, I think I know the context of that word now. Yes, I get it.

[00:04:21]
We'll see some stuff in a second. All right, I'm gonna take another run at that. As you can tell even from the slides, I was like, that's gonna be confusing. So self attention is like a group discussion. Each token can look at or talk to every other token to figure out what version of itself it means.

[00:04:39]
And then there's a time of quiet, individual thinking time, right? After all the tokens have shared all their information and played their weird guessing game with each other about what they all mean. Each one can kind of look at itself and adjust its meaning based on what it already knows, right?

[00:04:53]
So it's like, okay, I'm in a sentence. The other words in this sentence are burglar, hamburglar, [LAUGH] like whatever. I think I know which version of Rob we're talking about again. The banks of the Hudson river versus the banks that make deposit, it'll look at basically the meaning and the relationships of those other words to figure out its own which variation of itself it might be.

[00:05:23]
Like I said before, that output is where it comes out. You need to know this on a basis of like a conceptual understanding. Like I said before, there's not necessarily a quiz, right? But knowing that they look at the tokens around them, they get a sense of what those words mean to figure out where in that space their meaning of which version cuz that sentiment analysis broke, right?

[00:05:44]
The core piece is like, how do you not be like the sentiment analysis that we saw in the very beginning, right? You take into consideration the other words around you and what they might mean. We even saw that when we looked at the text generation early on, where we saw that if you started out in a very formal academic string, you ended up with a bunch of other words that were formal, academic.

[00:06:08]
In fact, we even saw, I think there was one text generation example, which was it just said a long time ago in a galaxy far away or something like that. And that was enough to put those words, which could mean a lot of different things when you only have like your brain jumped to the same thing my brain did, right?

[00:06:28]
But the computer, like it was like emperor with a capital E, mm-hm. Yeah, right, cuz again, like it looks at those words in every individual token, looks at the relationship of the other token. And what words kind of come around that and begin to like shift themselves into position to do effectively the same thing that our brains do, right?

[00:06:52]
And that context is the important part. Banks can look at river, can look at Hudson to know that means river, bank, bank, can look at robbed and thieves to see that it probably means the Minneapolis Savings Bank. I don't know if that's a real bank, but whatever. And that context is where it could begin to get the sense and move itself into the right position for the related context.

[00:07:19]
And that's what separates it from those very simple sentiment analysis and all those other things. So yeah, for the bank consensus, one, that vector gets shifted towards the financial meaning for the one, and the second one, it gets shifted towards the river edge meaning, right? And like we start to get a sense of the meaning of the entire phrase.

[00:07:38]
Cause each word is kind of shifting itself in that second introspection time, right towards different meanings. Now all of a sudden we have a better sense of what the whole thing means, right, through a lot of like math, and that's why your GPU and your fans start spinning on your computer and stuff like that.

[00:07:54]
It is obviously a lot more complicated than that, but conceptually that's what we're going for here. Like I said before, then we try to guess based on now we have some sense of the meaning, not just the words. We have some sense of the meaning that informs what the next word is going to be.

[00:08:13]
And then ideally we should be able to guess it as we have all seen, sometimes.

Learn Straight from the Experts Who Shape the Modern Web

  • In-depth Courses
  • Industry Leading Experts
  • Learning Paths
  • Live Interactive Workshops
Get Unlimited Access Now