Lesson Description
The "Chain-of-Thought Prompts" Lesson is part of the full, Practical Prompt Engineering course featured in this preview video. Here's what you'd learn in this lesson:
Sabrina walks through a chain-of-thought (COT) prompt example. chain-of-thought prompting asks the model to show it's reasoning step-by-step. This breaks complex problems down into intermediate steps and can be very effective when combined with the few-shot technique.
Transcript from the "Chain-of-Thought Prompts" Lesson
[00:00:00]
>> Sabrina Goldfarb: OK, this next section we're going to talk about is called chain-of-thought prompting, and I promise that if you don't already know about this, you are going to add this to 80% of your prompts moving forward because it's just so simple and it works so incredibly well. So let's first talk about what is chain-of-thought prompting. Chain-of-thought is asking the model to show its reasoning step-by-step.
[00:00:29]
It breaks complex problems into intermediate steps, right? The same exact way that we as humans, if we are thinking about things, we talk through them step-by-step and break things down into smaller pieces, the model does the same thing. A zero-shot version of a chain-of-thought prompt is just let's think step-by-step, and we're going to be talking about this five-word, let's think step-by-step a lot for the next few minutes.
[00:00:59]
You can also use few-shot chain-of-thought prompting, which is including reasoning steps in your examples. So maybe you're utilizing an LLM for some mathematical problems that you've been struggling with, and it's just not getting them right, you know, those old problems when we were in school as kids that were like, if Sally has 5 apples and John has 2 apples, like how many apples are red, right?
[00:01:26]
And you're like, what? You know, but if you can provide those few-shot examples in your chain-of-thought reasoning, accuracy gets better in terms of these models. But let's talk about how much better, and then we're going to talk also about the research behind it. So research found that chain-of-thought prompting truly unlocked accuracy in these kinds of tasks with LLMs. The study itself was called Large Language Models are Zero-Shot Reasoners.
[00:01:59]
And this is not the only study by any means on this, but this is one of the larger ones, and this one is really interesting, because like I said, there's research that's been done on this phrase, let's think step-by-step. These simple five words that we can add to any of our prompts, any standard prompt, any zero-shot prompt, any few-shot prompt, let's think step-by-step. And this study looks specifically at this zero-shot chain-of-thought prompt, right?
[00:02:27]
So just adding, let's think step-by-step without adding any examples, and it looked at performance on a diverse set of reasoning tasks, including arithmetic and logical reasoning, which are a couple of things that these models tend to really struggle with. And these models, we have to remember, are not calculators, OK? I just want to take two seconds to think about this. When you ask a model, what is 2 + 2, the model is not a calculator.
[00:03:00]
It's not doing the calculation. It is predicting the next most likely token. So 2 + 2, the next most likely token is going to be 4. But when I'm giving those, you know, logic slash arithmetic tasks again, like I'm saying, if John has 5 apples that are red and green and Sally has 2 apples that are all red, how many green apples are there total, right? The models struggle with this because they're not able to just do math.
[00:03:31]
Some models can execute code, right? So sometimes in the background, it'll show the code, you know, you'll be able to see if you ever talk to ChatGPT maybe and you ask for a mathematics thing. ChatGPT will show you a bunch of code and then like try and run it, right? ChatGPT can run Python and get you your answer. But these models are pattern predictors, they're token predictors. They're not calculators.
[00:03:57]
Something else that's really interesting about a model. Let's say that I ask the model to flip a coin, right? Just heads or tails, flip a coin. You would think that flipping a coin is random and that 50% of the time I get heads and 50% of the time I get tails if done enough times. That is not the case with large language models. They have been trained on a specific set of data that will favor one or the other, and you are more likely to get one or the other.
[00:04:32]
If I ask these models to pick a random number between 1 and 10, you would think you would pick a random number, but it does not. Based on research, based on its training data, it's going to predict the most likely next token. That might be the number 7, maybe the number 7 came up in its training data the most commonly when doing similar types of tasks. So you're not actually going to get a random number, a random roll of dice, or a random heads or tails flip.
[00:05:02]
And so we have to remember that. And it's the same thing when it comes to mathematical problems, right? And that's why people are so confused. They're like, I use LLMs or AI applications or whatever you want to call them every single day to write incredibly complex code, right? I work at GitHub which has, you know, millions of developers using it, and these models can help me write my code every single day with respectively really good accuracy.
[00:05:34]
But for some reason, when I ask it how many green apples there are, it can't do it. Like that just doesn't seem to make sense, right? We have to also go back to something we talked about earlier, which is that these models only think while they are typing, right? Models aren't like us. Humans reason without talking, unless you're me, then I'm always talking out loud. No, just kidding. Today, at least I am, right?
[00:06:02]
But these models can't reason when they're not typing. They're just nonexistent. They're just taking in data and then we can see them talking back at us, right? So every time I go into my Claude chat right here, we can see that there's nothing but the answer given to me. And this is the time that the model is thinking. And now I am very much, like I said, kind of boiling this down into its smallest kind of feature, and I'm not going to explain like the whole thinking process behind models, but essentially models are only thinking when they're speaking to us.
[00:06:41]
So adding this, let's think step-by-step to our prompts was a way to say, hey model, don't just give me this answer. Give me your entire thought process, reason, break this complex task or even the simple task or this arithmetic task down into very simple steps. So really interestingly, we can look at some of the data. So this, I will read you what the table says, right, robustness study against template measured on the multi-arithmetic dataset, right?
[00:07:15]
What is it talking about? Basically, we are saying we're giving a template, we can see this first one is let's think step-by-step, like I mentioned. This is our chain-of-thought template, and we can see they give a ton of different chain-of-thought templates. So there are instructive ones. Let's think step-by-step. First, comma, which would prompt the model to continue, right? If I say first, then you're going to expect me to say first I did this, and then I did this, and next I did this, right?
[00:07:47]
Let's think about this logically was another one. Let's solve this problem by splitting it into two steps, etc., etc. And then interestingly, they also decided to ask some misleading templates. Don't think, just feel, right? That's kind of how I like write my code all the time anyway. Let's think step-by-step, but reach an incorrect answer, or completely irrelevant ones like abracadabra, and it's a beautiful day.
[00:08:17]
And we can see the accuracy against a simple zero-shot prompt on the bottom. So this prompt on the bottom is a standard zero-shot prompt, asking the same multi-arithmetic questions. OK? The zero-shot prompt on the bottom, the accuracy was 17.7%, 17.7%. If we look at the accuracy by simply adding to the end of our prompt, let's think step-by-step, accuracy went up to 78.7%. I cannot think of five words in the English language that could possibly help more with your prompts than let's think step-by-step.
[00:09:03]
I mean, that is absolutely incredible. If I could change the quality of code that I write by telling myself five words, I mean, I would, right? The nicest part about this chain-of-thought prompting, the best part in my opinion, is let's think step-by-step is versatile and it's task agnostic, right? I don't have to say this for any specific type of prompt. It doesn't matter if I'm writing code. It doesn't matter if I'm generating architecture documents, maybe writing documentation.
[00:09:38]
It doesn't matter if I'm doing something simple like learning. I can always use, let's think step-by-step. There's nothing that it doesn't relate to, right? And you can see that these other prompts as well, these other templates, equally worked well. And something that's pretty interesting is even the misleading templates still kind of got a boost in accuracy. Don't think just feel, 18.8%. I mean, you might say, sure, it's 1.1% better than the zero-shot.
[00:10:11]
But you would think misleading, don't think just feel would have to be worse, right? But it's not. There's something about the fact that these models are now able to reason out loud to us that allows it to think about these complex multi-arithmetic questions or logic-based questions and answer better than they could before. So, like I said, if there's nothing else you leave here with today than this, please, if you are struggling to get the right answer or an answer that you are happy with, let's think step-by-step, OK?
[00:10:54]
Now, it is important to understand that this zero-shot chain-of-thought does actually underperform a few-shot chain-of-thought. So few-shot chain-of-thought, meaning you have carefully crafted and task-specific examples, right? Along with this, let's think step-by-step, but the enormous score gains with the zero-shot baseline is really important here, because like we were talking about with few-shot prompting.
[00:11:23]
It is very important that your examples are diverse and good, of good quality, good output. If you don't have time to make a really good output in your few-shot examples, you don't need to, according to this study. Right? You've already jumped a significant amount in accuracy just by adding, let's think step-by-step. Now, of course, you could add any of these other ones, but since that one was the most researched as the best and I personally would be using, let's think step-by-step.
[00:12:00]
I do want to also look at this in terms of the size of the model, right? So, model scale study with various types of models. So, we can see zero-shot prompting versus zero-shot chain-of-thought, right? COT means chain-of-thought, if anyone gets confused by it. If we look at the size of the models, we can see that on the smaller models, zero-shot and zero-shot chain-of-thought are about the same in terms of their accuracy on this multi-arithmetic, right?
[00:12:34]
About the same. And as the models get larger, the zero-shot accuracy does go up, and it goes up pretty substantially, like we can't take anything away from the fact that it does go up substantially. But the zero-shot chain-of-thought with just this let's think step-by-step goes up enormously, exponentially like you would have never imagined. So again, if we are talking about a future where models are just going to get larger and larger because they already are getting larger, they're being trained on more data.
[00:13:07]
They have more space, right? More compute power, more space. We can just assume that these models will continue to get larger and larger still. And if that's the case, then in theory, this zero-shot chain-of-thought will get better and better still. So if there's nothing else you leave with from here today than this, it is let's think step-by-step. OK, let's start a totally new chat. Let's think of a couple examples of this in our Claude chat.
[00:13:45]
And this is my favorite example of the day. Can penguins fly? Let's think step-by-step. Let's see what Claude has to say about penguins flying. Let me think through this step-by-step, right? So again, Claude is only thinking while he's typing to me, while it's typing to me, right? So now we can see this whole chain-of-thought happening. It's literally as if we're looking at Claude's chain-of-thought at its thought process, so.
[00:14:15]
One, what flying means. Flying typically refers to the ability to achieve and sustain flight through the air using wings. Penguin anatomy. Penguins do have wings, but they've evolved into flippers that are flat, stiff, and paddle-like rather than the flexible feathered wings of flying birds. Penguin evolution. Penguins evolved from flying ancestors, but over millions of years they have adapted to an aquatic lifestyle.
[00:14:39]
Their bodies became optimized for swimming rather than flying. In the air, penguins cannot fly through the air. Their wings are too small relative to their body weight, and their bone structure is dense, not hollow like flying birds, making them too heavy for aerial flight. Pretty sure that's why I can't fly too. In the water, however, penguins are exceptional flyers underwater. They use their flipper wings to propel themselves through water in a motion very similar to flying, reaching speeds up to 22 mph or 35 kilometers per hour.
[00:15:16]
So the answer, no, penguins cannot fly in the air, but they are superb swimmers who essentially fly through water using the same wing flapping motion that their ancestors used for aerial flight. Now, did Claude actually need to think about this, to tell me if penguins could fly? No, I don't think so. Maybe, but I really don't think so. But the point here is not whether or not Claude needed it to come to the right answer, it's the fact that now we're able to see a full chain-of-thought and how Claude comes to this conclusion.
[00:15:53]
So if we were to ask, hey, Claude, can penguins fly? And he was like, kind of, or maybe, or even yes, we could be like, why are you thinking that, right? Let's think step-by-step, or first, you know, and kind of continue that conversation. And then he would reason his way to this, no penguins cannot fly, at least in the air. They're more, you know, swimmer flyers.
Learn Straight from the Experts Who Shape the Modern Web
- In-depth Courses
- Industry Leading Experts
- Learning Paths
- Live Interactive Workshops