Code Execution Tool

Netflix

Lesson Description

The "Code Execution Tool" Lesson is part of the full, AI Agents Fundamentals, v2 course featured in this preview video. Here's what you'd learn in this lesson:

Scott discusses code execution as a tool like shell commands but avoids implementing it due to safety and complexity. He emphasizes building higher-level tools to streamline workflows and improve AGI functionality.

Join Now

Preview

Transcript from the "Code Execution Tool" Lesson

[00:00:00]
>> Scott Moss:So I do have this other tool that's kind of similar to the shell command. It's called code execution, and I've been going back and forth on whether or not we should implement it and I think it's worth talking about. I don't think we're gonna implement it for the course, but I will leave it as an exercise for you all to try and implement, and I'll tell you why I don't want to implement it for the course.

[00:00:22]
Essentially, proper code execution would be what I showed you earlier using a sandbox, you know, isolating things away so code isn't being executed on your personal machine with sensitive information. I didn't set any of that up and I didn't want to set any of that up. I didn't want to have to sign up for another account. I didn't want to have to get into sandboxing and all that stuff. So what I decided to do, if you think about it, we already have all the tools we need to execute code.

[00:00:50]
We only need two tools. We need to be able to read and write to files, or technically we only need to be able to write to a file, read just to verify, and then we need access to a terminal to execute the file. That's it. And then we can delete the file when we're done because it's a temporary file, so it's done. So we could have the LLM generate some code that it wants to execute, put it in a file, save that file, use the terminal to execute the file.

[00:01:16]
Once that file is executed and it's got the results that it wanted, it can then delete the file. That's like a very primitive code execution. There's two ways you could do that. One way is to not make a code execution tool at all, and we can just ask the agent to do it right now. It already has terminal access, it already has read and write file capabilities, so we could just do it right now. The problem with that is that's very much of a workflow that I think it should do every single time, so I would have to specifically tell it, this is how you execute code.

[00:01:50]
You know, I can put that in a system prompt, I can put that somewhere else because it's a set of tools that have to be ran in a certain order every single time I want you to do one thing. And what if I want you to do more than one thing, then it gets really complicated. And that's because the tools are very atomic. We have a read a file, write a file, delete a file. That's not the best way to make tools.

[00:02:13]
That's what I showed you all today because we're building an AGI that has very atomic tools, but if you're going to build something a little more efficient, you would also have what I would call higher level, higher leverage or I guess higher level tools. These are tools that encapsulate all of that into just one tool, right? So for instance, instead of expecting the agent to be like, hey, you know you can execute code if you just use the right file, run that tool, do another loop, and then use the run command to do node, the name of that file, get the results of it, do another loop, get the results of it, and then you can respond to me.

[00:02:51]
Instead of explaining that, I could just make one tool that does that, right? Why have the agent waste cycles on something, essentially a workflow that I want to be the same every single time. So in this case, this tool is like a workflow. So if you look at this execute code tool, it has a lot of stuff. It has the code that you want to execute, so the LLM can pass in the code, the language that it's in JavaScript, Python, or TypeScript, and then there's an execute function that does exactly what I just described.

[00:03:28]
It will essentially create a temporary file, it'll write to that file. It will, depending on the language you gave it, run the binary. So if it's Python, you know, it'll run Python 3, if you have that installed, if it's Node, it'll run Node, you know, if it's TypeScript, it'll run TSX, it'll just do that, look at the results of it, it'll feed those results back to the LLM and then it'll delete the file, right?

[00:03:50]
It's all just in one tool. You can do whatever you want. You can put another LLM in this tool, you can put another agent in this tool. There's no end to what you can do in the tool. You're only bound by your transport mechanism. Like if you are calling an agent that's over HTTP and your tool takes 3 minutes, yeah, you're gonna get a timeout obviously, so you need to figure out a better way to have a transport for your agent.

[00:04:12]
Maybe in this case WebSockets will be better or streaming over SSE might be better. So you're only limited by your transport mechanism, you're not limited by the LLM in any way. It doesn't know, it doesn't care. So this tool could be whatever you want it to be. It's a workflow, it's not an atomic tool that does one thing. So the reason I don't want us to cover it is because it's pretty dangerous. One, I don't want anybody to mess something up on their computer, and two, there are a lot of variables going on with these binaries, and I don't know what people have installed on their computer, so this stuff may or may not work.

[00:04:48]
So I didn't want to do it just to be like, oh, this doesn't even work on your machine, and some people have Windows, some people don't. Some people have Mac, some people don't. There's too many variables that I'm not accounting for in this, so I left it here as an exercise. I don't want us to do it and get to the point where it's like, oh yeah, it doesn't work for you, sorry, move on. But so it's here if you want to check it out, but I wanted you to understand it because it's a higher level tool.

[00:05:13]
And this brings me to my point of calling MCP. The reason MCP actually could be better than it is, is because a lot of people just were like, all right, I'm just gonna take every API route ever or every SDK method ever on every SDK that I want. I'm gonna make each one of those a tool, very atomic. OK, that's cool, but now you're making the LLM work way harder, right? Like for instance, the Notion MCP server.

[00:05:38]
I hate it. Can't stand the Notion MCP server because it's not obvious how to write to a Notion document by looking at the tools in the Notion MCP server because they made them so atomic. It's like update block or find the block and I'm like the LLM has no idea what a block means in the context. It's just trying to write to a file. Notion would have been better off making a tool that says like update file and that file did all the block stuff and everything that you need to do and it asked for the right parameters.

[00:06:06]
Why does the LLM need to know about Notion's concept of blocks and oh sorry you gotta call this other one first to get a block ID did you know that? Oh, you didn't know that? It's too low level, no one's ever gonna get use out of that MCP server without being super aggressive about telling their LLM how to do it perfectly every single time. Notion should have instead abstracted all those tools into higher level tools and been like write file, update file, delete file, search for files, not 30 micro super tiny tools that do one thing that only if you read their API documentation and understood it very well and their thesis on how they, their data model and their database schema, would you understand like it's just too much.

[00:06:54]
So if a PhD performing model cannot read your MCP server descriptions and understand how to write to a file, you did something wrong, I promise you. So go for higher level tools, abstract them to be atomic as you move forward and not the other way around. Don't start low level and go up, start high level. Each tool is a thing you're solving for because think about it, it makes it easier on your evals too, because you have all these atomic things you're doing all these evals and all these atomic things.

[00:07:21]
But if you can be like, ah, here's a use case that people always do, what if, is there a way for us to make a tool, one tool that solves that use case perfect every single time, right? Like what if we solve people on our agent, they did this which, you know, ran this one tool, which then also influenced to run this tool and then it also influenced to run this tool every single time. And when it did those three tools in the same order, it was perfect.

[00:07:45]
OK, how about we make one tool that just does all three of those in the code automatically every single time, deterministic, and we'll figure out what the inputs are and we'll get rid of the other ones and we'll just replace it with just that one and then now everybody gets to benefit from the deterministic workflow as a tool, right? So try to be high level and then take it down as you need, cause you will, you will need to take it down, but start as high as you can and not as low as you can.

Learn Straight from the Experts Who Shape the Modern Web

250+
In-depth Courses
Industry Leading Experts
24
Learning Paths
Live Interactive Workshops

Get Unlimited Access Now