AI Agents Fundamentals, v2

Evals Telemetry with Laminar

Scott Moss
Netflix
AI Agents Fundamentals, v2

Lesson Description

The "Evals Telemetry with Laminar" Lesson is part of the full, AI Agents Fundamentals, v2 course featured in this preview video. Here's what you'd learn in this lesson:

Scott explains synthetic data and creating use cases to test agent performance, covering data collection and evaluation. He also demonstrates using Open Telemetry with Laminar for improved observability and metrics.

Preview

Transcript from the "Evals Telemetry with Laminar" Lesson

[00:00:00]
>> Scott Moss:Where do you get the data? Synthetic data. Most likely we'll start with that. This is data that we make. So what we're going to do is we're going to make like, what I like to make three different cases. I like to say, here are our golden use cases. These are use cases that like, I feel very strongly if somebody typed this in, our agent should be perfect. It should, I know for a certain degree that it should do this and this and this and this and this order, this tool call, it should always do this, right?

[00:00:25]
So I'll think of the use cases in my head where I think the agent should do that and I'll write, and I'll just make that fake data. Then I'll think of like secondary ones. These would be like, our agent should still figure this out because it's smart enough, but like the user didn't really do a good job of making a good prompt here, but the agent should still figure it out. But I'm less. I don't really care in which order it figures it out.

[00:00:45]
I just want to figure it out and it should. I'm less confident in what that order might be and what that path might be because the prompt was kind of not there, but we should still handle that. And then I'll make a third one which is like a negative use case, which is like somebody asked our agent to like mop their floor. We should never, like the agent should just be like no, right? The agent should be like, it should not go pick a tool.

[00:01:08]
Oh sure, let me pick the Gmail tool to mop your floor. Like why are you doing that, right? That would be bad. So I want to write evals for negative use cases as well for things that we don't and should not handle, right? So that's typically where I start and then from there I start collecting data as people use it and I'll add to those buckets of being like, yep, this is a perfect use case that we want to handle.

[00:01:30]
We should probably handle this, but this is a shitty prompt, these people are bad. Why are they asking us this? This is not something. And then that also might be a thing like, damn, a lot of people are asking us to do this. Should we support this, right? And then you can get ideas that way too, right, through evals. There's also hill climbing, hill climbing is just like you don't do any of that data stuff that I just talked about and instead you just start off with an empty run as the baseline and then you compare everything after that.

[00:02:00]
So the first run is the data, it is the baseline and then everything is, you compare everything after that, right? That's literally what hill climbing is. So you just get baseline scores from where you are right now and then you go from there. That's effective, but it could take a while. It's called hill climbing for a reason, but yeah, you do that, you know, scores improved, good, keep that change.

[00:02:28]
If they didn't, get stash because that ain't it. You went down whatever you did is not it undo it, right? And then you keep repeating, so. Talked about that. Last thing is like scorers. These are like tests, these are assertions. These are things that convert some qualitative thing to some quantitative thing and we get a score usually between zero and one. That way we can like do averages and give you a score, right?

[00:02:57]
One is perfect, zero is like, what is this? It's like you made a function that return, no, that's zero, right? So you can make your own, there's tons of. This is a whole science, okay. There are people who literally only do this. They're like a combination between like data science people and, you know, machine learning engineers that just come up with the metrics and the scores and stuff like it is very complicated, okay.

[00:03:23]
I don't know enough about all the different creative ways to do this. I've just learned by using the stuff that I've seen in the community and looking at really cool examples and just do trial and error. But yes, evaluators are things that do exactly what they say. They evaluate some input output pair. So usually you get an input, you get an output, and you get an expected output, and you evaluate those things to see and you give it a score, right?

[00:03:49]
We talked about tracing, we're going to set this up in a little bit. But tracing essentially allows us to have analytics on our LLM. We're going to be using something called Otel, which stands for Open Telemetry, if you know what that is. It's an open standard for observing applications and it doesn't belong to anybody. It's just an open standard for observation on applications, and there's tons of apps that implement like GUIs for it, right?

[00:04:18]
So, there's this one that we're going to use called Laminar, I think that's what it's called. So, sign up on an account here, it's free. You don't need a credit card. I didn't. I've been using it for the last two weeks. I'm probably going to use it going forward. It's a new one, it's not the standard one. Everyone probably uses Brain Trust data as the standard, but that shit is confusing as hell. I could teach a whole course on the week on how to use Brain Trust.

[00:04:40]
I don't think we're ready for that. Laminar does everything Brain Trust does, but in my opinion, a better interface. Although I do love Brain Trust, I know y'all listening. I'm going to use Laminar because it's just free, so. Sign up for an account and use this. You don't have to do this if you don't want to. Like going forward, we, you will be able to continue to following me along if you don't feel like doing this.

[00:05:02]
It literally takes two seconds, so you should do it. I think it's cool, but if you're like, ah, I don't feel like signing up, but I don't want to do this stuff, it's fine. Nothing's going to break if you don't, if you don't use this. It's just observability, just enhancing what we have. It's really easy to set up. So, you can go there, make an account. Get your API key. Added to your .env file. Should take less than a minute if you want to do it.

[00:05:25]
I'm going to show it so you can see what I'm talking about. So if you're like, I'm just going to watch him and see what he's doing, so I can see what it looks like, that's fine. Again, this won't break it in. It's like adding analytics. Analytic having analytics or not analytics is not going to break your app, but having this does help with eval, so you can like, you know, go down to the root level of things.

[00:05:48]
But Laminar also is the thing we're going to use for eval. So although you don't need to sign up for the account to use it. If you want the dashboard to see everything, you're going to need to sign up. Otherwise, it'll like generate the eval in the terminal and then like, that's it. You won't actually be able to see it. So for a better experience, I do highly recommend signing up, but you don't have to if you don't want to.

[00:06:10]
I'm pretty sure we can run their evals without signing up. According to their docs, I haven't tried it, but according to the docs, you should be able to run it without signing up. So if you don't sign up and you have an issue. Just sign up. And no, they're not sponsoring me. They are a YC company and I know because I've been to YC a few times, people think I'm like doing YC sponsors and stuff. I'm not.

Learn Straight from the Experts Who Shape the Modern Web

  • 250+
    In-depth Courses
  • Industry Leading Experts
  • 24
    Learning Paths
  • Live Interactive Workshops
Get Unlimited Access Now