AI Agents Fundamentals, v2

Web Search for Agents

Scott Moss
Netflix
AI Agents Fundamentals, v2

Lesson Description

The "Web Search for Agents" Lesson is part of the full, AI Agents Fundamentals, v2 course featured in this preview video. Here's what you'd learn in this lesson:

Scott covers web search for agents, showing how LLMs can access online information while managing context and grounding outputs in truth. He discusses using native tools, handling costs and limits, and balancing efficiency with model context constraints.

Preview

Transcript from the "Web Search for Agents" Lesson

[00:00:00]
>> Scott Moss:We're going to move on to web search and context management. Don't worry, we are not going to build a web search tool here because that would take a very long time. Well, actually I guess not. If you were willing to pay money, it wouldn't take that long, but we're not paying money on stuff, so we're going to learn about a different type of tool calling other than function calling, which is what we've been doing.

[00:00:22]
But first, let's talk about web search. So, and then additionally context management, which I'll talk about in a second. So, first thing is web search. It's exactly what it sounds like. You use Google, what if your LLM could use Google? That's essentially it. But it's harder than that, it's really harder than that because how do you use Google? Probably, I don't think anyone here uses Google with an API and if you do, why?

[00:00:51]
And second of all, how? We all mostly use it through a browser, so in this case, when I'm talking about web search, I'm not talking about letting an agent use a browser, which they can, but that's not what this is. There are tools that allow you to allow an agent to use a browser and they're really not that hard. There's an MCP tool that uses Playwright, there's Browser Base, there's so many things.

[00:01:12]
Heck, this course was going to be building a browser driving agent, right? It was just too complicated. But that's not what I'm talking about. Specifically talking about programmatically letting your LLM hit some API and get results from the web. You might hear that as like grounding, right? Grounding as in like making sure your LLM has like accurate, up-to-date information before it decides. So it's like grounded in truth.

[00:01:37]
It's real up-to-date truth, so you'll literally hear that referred to as grounding a lot. And there's tons of different providers that offer this as a service, you might have heard of Perplexity. There's this other big one called Google, I'm sure you've heard of that, they also offer something similar and there's just so many. Firecrawl, Xa. You know, and then providers themselves have their own, and that's what we're going to use today, so.

[00:02:01]
We're going to use OpenAI's built-in web search tool. This is what I call like a native tool, different than function calling. Same concept, it's just that the code isn't running on our computer. Function calling, as you saw and you just wrote, the execute function is on our machine. It's running on our machine, wherever our machine is, in this case, it's our computer. It could be on our server, but it's a function that we wrote.

[00:02:23]
Even if we didn't write the function and we npm installed it, it's still running on our machine, okay? That's function calling. A native tool would be a tool that's running on the provider in which you're talking to. In this case, we're talking to OpenAI. So in this case, the native tool that we're going to use, the web search tool is not running on our infrastructure, it's running on theirs. Okay, so they make the call, they have their own function, and then they put the context back into the agent, right?

[00:02:53]
Or, we still got to put the context back to the agent, but the compute is on their side and they return it to us. It's just like making an API call, essentially. It's like the tool itself is just an API call on the provider server and we get the results, we put it back into the agent. But it's on the provider side. They provide that. So OpenAI has a set of tools that they provide. They provide web search, they provide code execution where you can give like some arbitrary, you know, JavaScript string, Python string, and it will execute that for you and give you the results, right?

[00:03:25]
They have all that type of stuff, so you don't have to. And I talked about some of the other ones like Perplexity, Google Gemini, all these different things. Because it is on their server, obviously you're going to be subject to like cost and rate limiting and speed and less control of how the results are, like, you know, and we'll talk about some of the issues because of that. And it only works for specific models and providers, right?

[00:03:52]
So like if you were wanting to use a native tool and you're not using OpenAI's models, you could not use that tool. You have to use their model for that tool, so you got to hope the provider that you have has that tool. So, yeah, you're kind of locked in. You can't customize search sources or anything like that. They have their own algorithm that they built that does the searching. You just call it with a query and that's it.

[00:04:14]
You don't do anything else, right? Tool-based web search, this is like, you know, we implement a search as a tool that the agent can call, very similar to what we just did with the file stuff and we can implement it from scratch or we can use one of these API providers, like, you know, Xa. This is probably one of the first ones. It's pretty good. It's literally Google for AI. You can crawl websites, you can get answers, you can research.

[00:04:46]
They have something called web sets, I have no idea what that is, but it's like a search engine for AI, you know, it costs money, of course, but it's pretty good, you know. In fact, I think Vercel AI SDK web search, Vercel actually has a great document specifically on all the different ones you can have. So like, the OpenAI one, that's the one that we're going to do. They have examples of how to use Perplexity with them, how to use Gemini with a Gemini model.

[00:05:17]
And then using tools, these are the ones that like you can just install from providers that offer this as a service. So here's Xa, right, you can use that. There's something called Parallel Web, which I just discovered. I want to try it. I haven't tried it out yet, but it looks really cool. I don't know. I like the design. It's pretty, pretty, pretty clean, and they're boasting, where is it at? They got something on here about like how good they are.

[00:05:44]
Yeah, right here. Look, they're all over here. Everybody else is over here. That means they're good. Like, so I want to try that. So there's a lot of people trying to solve this is the point that I'm trying to make. So, and I'm also trying to persuade you from making your own. Please do not make your own web search tool, like just don't. As an exercise, yes, go do it. It's fun, it's great. But then stop.

[00:06:06]
Don't go past that exercise. Use something else. It's just so much better to use something else, okay? That's the point that I'm trying to make here. And again, we're going to use the OpenAI one. And it's pretty simple, but we need to understand how this works and like how greedy it might be. I don't know the descriptions or the prompts that it's telling our LLM unless I do some tracing of how eager it should be using the tool.

[00:06:40]
I'm guessing OpenAI is like, you should use this as many times as you want, Mr. LLM or Mrs. LLM and don't be afraid to use it and that's why it, as we'll probably soon find out, it will like call the web search tool many, many, many, many, many, many times, I think for like the simplest query. So I don't know if they're putting in the description, but I'm sure it's to make sure you pay them more money.

[00:07:06]
But the point is this is a really good example where there might be like a parallel tool call for a web search because it's like it's the equivalent of like if you were going to Google something. Like if you were really good at Googling, your Google skills were crazy, you might open up multiple tabs and go Google in like multiple tabs in one place, just to have like different branches of Googling.

[00:07:23]
It's the same thing. So the agent might be like, all right, let me open up multiple tabs, right? I'm going to start a search over here for this term, I'm going to start this search over here, I'm going to do this. They're all running parallel. I'm going to see which one's better, wait for them to come back. Okay, cool. This one sucks. We're going to dive into this one. That's a strategy that you might see with something like a web search, which leans closer to what you might have heard of like deep research.

[00:07:50]
It's like a deep research tool, which is very similar to that. So if you've never used deep research on like ChatGPT or Gemini, it's like a little thing you can toggle on. Highly recommend using deep research to go research stuff. It's like a set of tools like web search and stuff that can take like 10 minutes to run. It takes a long time, but it comes back with a full report of everything with annotated sources and all that stuff.

[00:08:13]
It's really cool. I use it for everything. This would be a precursor to that of making your own deep research tool, which I do highly, highly recommend making as an exercise. If you can make a deep research tool on any index, whether it's the web, whether it's a GitHub repo, whether it's I don't know, a database. The index doesn't matter, but being able to like deeply research something to find truth and annotate those sites or cite those sources is a really good tool to have in your bag to be able to do, in my opinion.

[00:08:51]
Now the issue is something like searching the web. It's going to lead to a context window problem. Compare all models. Oh my God, they don't want to show, oh, here we go. They, I was about to say, I know they show you the tokens. Here we go. So, for instance, GPT-4.5, let's do 4.5.1 and I don't know, o3 Pro. Cool. If we go down here to context, here we go. We have something called a window. Says 400,000 on these two, 300,000 on that.

[00:09:25]
That window is everything that's in that conversation array. How many tokens can you have in that array? Whether it's input or output, doesn't matter, that's how many tokens you can have. A token, on average, you'll see when we do this exercise, it depends on the model. Each model calculates tokens differently, but you can say like on average a token is like every 3.5 words or I'm sorry, every 3.5 letters, that's like a token, right?

[00:09:52]
Somewhere around that. So 400,000 tokens in a window. Once you get past that, once you get past that, the model can no longer do inference. The token limit is how much it can carry. This directly affects inference because all of these tokens have to be kept in memory and when it's trying to figure out what the next token to generate is because that's how the attention, the T in ChatGPT, calculates what the next thing is.

[00:10:18]
It considers all the words before it and all the words after it. So if it's trying to calculate the next token and it has to like look this way and look this way, the more you have in both directions it's going to take longer, right? And you're eating up more RAM, which is, I'm not sure if you saw, there is a RAM catastrophe in America right now, so the RAM prices are insane. It's literally gold. So that token window is precious and some models have like a million context token windows, like all the Gemini models are a million.

[00:10:51]
I've seen open source ones that are like millions, and you might think that's better. It's not exactly true, but sure, more context is helpful. And then you have them split up by inputs and outputs. I can see they don't have inputs here, but you can do the math and figure out what the inputs are. So max output tokens, these are all the tokens that the LLM will generate, right, in one conversation, right?

[00:11:18]
It won't generate more than this, so it'll literally do 128,000 tokens output, this one will do 100K. And then you'll have something called input tokens, which is, you know, the difference between the 400 and the 128, that would be the input tokens that you can send on one turn. That's all I can handle. Should you always do that? No. There are issues with nearing that limit. You don't always want to get to that limit, and here I'll talk more about tokens and output tokens.

[00:11:50]
Oh, I already put them in here. Oh man, I'm so stupid. Talk about a little bit why they're limited. This is the attention mechanism that I talked about, the T in ChatGPT. You can look at the math there. But there are also consequences for approaching that limit, right? So like even though you might be like, oh, I'm not at 400 yet, I'm at 360,000. I bet you noticed the performance and the accuracy of your agent getting worse, right?

[00:12:12]
Because there's so many issues as you get closer to that limit, one of the things is called like lost in the middle. So if you have some information that's in the middle of that context, the agent, the LLM kind of forgets it. It remembers like things at the beginning, it definitely remembers things at the end, the most recent things, but somewhere in the middle, it kind of forgets because there's, because if you, again, if you think about the algorithm, how attention works, if I'm trying to figure out what the next thing is and I'm going in both directions, I'm considering something this way, I'm considering something that way.

[00:12:51]
If I'm in the middle, then there's a lot for me to consider here, right? So it's just a lot of information. So, it's just hard to think about. It's also because the influence of it. Like if I'm looking at all the history of the context that came before me. The thing that I see most recently has been influenced less by other tokens because I haven't gotten there yet, and the tokens at the beginning are less influenced by other tokens because there's nothing before it.

[00:13:15]
So those tokens are just more recent, whereas the tokens in the middle are heavily influenced by the tokens before it and the tokens after it. So it's harder for the model to remember those or consider those. So positional matters. So that's why like having a million token limit is like, oh, that's cool. But the accuracy is so bad, I don't even want to approach that. So what's the point? Yeah, that's one thing.

[00:13:40]
The other thing is just like speed and cost, right? Sending all those tokens across the wire, if that's the architecture that you have, it's slow in requests, right? It's just everything just gets slower, the inference gets slower. That model that performed 99% well is not performing 99% well anymore because the conversation is just getting out of hand, you're making it work harder. So, your goal is to keep the context window as efficient as possible.

[00:00:00]
Only let in what you need to let in. And then find ways to not have things in there that don't matter. But that's the science. It's like, how do you know what matters and how do you know what doesn't matter when it's all semantics, it's not logical, right? It's not something you can just like write a code for.

Learn Straight from the Experts Who Shape the Modern Web

  • 250+
    In-depth Courses
  • Industry Leading Experts
  • 24
    Learning Paths
  • Live Interactive Workshops
Get Unlimited Access Now