AI Agent: From Prototype to Production

Viewing Eval Results

Scott Moss

Scott Moss

Superfilter AI
AI Agent: From Prototype to Production

Check out a free preview of the full AI Agent: From Prototype to Production course

The "Viewing Eval Results" Lesson is part of the full, AI Agent: From Prototype to Production course featured in this preview video. Here's what you'd learn in this lesson:

Scott demonstrates the React dashboard application included in the project. The dashboard displays the results from each eval run. The data for the dashboard is pulled from the results.json file in the project's root.

Preview
Close

Transcript from the "Viewing Eval Results" Lesson

[00:00:00]
>> Scott Moss: So now that we have that, we actually have a dashboard that we can go look at to visualize these changes over time. So we can see a chart on whether or not we're getting worse or we're getting better. To run that, open up another tab in your terminal.

[00:00:18]
cd into the experiment-viewer.
>> Scott Moss: I'm sorry, not experiment-viewer, dashboard. Why does it say experiment-viewer, that's weird. I changed it, I never went back, okay. cd into the dashboard. And then from there, there is a command that you can run, which is going to be dev. So you should be run npm run dev, that's gonna start an app.

[00:00:52]
You can go to that app and you'll see something like this, only ran it once, so I have a score of 1, so my dot is here.
>> Scott Moss: I'm going to run it again, intentionally breaking it so we can see the change. So I'm gonna say, yeah, I'm gonna say hi, right?

[00:01:21]
Or actually, what I'll do is I'll keep this one, right? I'll just add some more data, I'll add another piece of data, and this one will just say, all right, but I still expect it to call this function for some reason. And actually let me make sure that my tool call thing supports that so, actually we're about to find out.

[00:01:49]
I don't know if that supports that or not, but we're gonna see. So, now I'm gonna go back I'm gonna run it again.
>> Scott Moss: Yeah it did, so, as you can see it regressed so, previously we had a 1, now we have a 0.5 because one was right and one was wrong, so that's 50%.

[00:02:09]
So it went down 50%, so that was a pretty dramatic loss there. And then this just shows you that right there, so you can see the drop right there visually on the graph. And this is what you would get on that, like brain trust data. Obviously their product's gonna look way better than this, but they give you some sort of visualization on whether or not your thing increased or decreased and it'll log it out for you like this as well.

[00:02:33]
So you can look at it there, so you can look at it here in the log when you run it. You can look at it visually here and eventually be able to change these experiments. But you can also just go to your results.JSON. And you can look look at them here, so I can see that on this first run that had, I got back a 1.

[00:02:51]
It totally worked, it was great. And then I did it again. I got back a 1, but I also had another piece of data on this other one, and I got back a 0 on it, so its score was point 0.5. So, I mean, this is just JSON, you can build whatever dashboard you want based off this.

[00:03:07]
But yeah, so couple things, if you are running on Windows, trying to run these evals, the scripts that I created for you are probably incompatible this. Where is it, this run file here that does some foul stuff is incompatible with how the latest version of node does dynamic imports.

[00:03:29]
But no worries, you can just do this other thing, which you can just run the file directly so you can do MDX, TSX. TSX is like TS node, it allows you to run TypeScript from node, if you have button, you can just use button too, but NPS, TSX, and then just the path to that file.

[00:03:48]
So in this case, it'll be evals, experiments, and then Reddit. So you can just run it like this and in my case, it's gonna break because it's saying the open AI Key is not loaded and that's because that environment variable gets loaded when you run my script. So if you're gonna run the file directly, you would have to import .env at the top of every experiment, so to import your env file.

[00:04:18]
So that way you don't run into that error.
>> Scott Moss: And then you can do the same thing, you get the same result.

Learn Straight from the Experts Who Shape the Modern Web

  • In-depth Courses
  • Industry Leading Experts
  • Learning Paths
  • Live Interactive Workshops
Get Unlimited Access Now