Check out a free preview of the full Complete Intro to SQLite course

The "Aggregation" Lesson is part of the full, Complete Intro to SQLite course featured in this preview video. Here's what you'd learn in this lesson:

Brian demonstrates aggregation commands like COUNT, GROUP BY, and HAVING. Since a WHERE clause applies to the initial data set, not the aggregated data set, commands like COUNT and GROUP BY must be used in a HAVING statement to filter the results.

Preview
Close

Transcript from the "Aggregation" Lesson

[00:00:00]
>> Brian Holt: That's foreign keys, let's get into aggregation. So, you've probably seen this query before. It's one of the most common ones you'll run. What if you just wanna know how many tracks we have in our database?
>> Brian Holt: You have this thing called an aggregation function, which is count.

[00:00:22]
Count star, as you might guess, is just count how many things are that, you know, are in this database and give me back the number. So, in our database we have 3,503 tracks. As you might imagine, you're able to count how many DISTINCT things you have in the database like DISTINCT values.

[00:00:41]
You can count how many things are greater than a hundred. You can give it a bunch of different aggregation things. Let's do DISTINCT. So, we're gonna do SELECT COUNT. There's no space there, it doesn't matter. But DISTINCT GenreId FROM Track. So, basically, we went into the Track database, and we counted how many different, DISTINCT means different, in this particular case, different values of GenreId that I have, right?

[00:01:19]
Now I could have also done SELECT COUNT star from genre, right? That would also give you the same thing. But we could also do ArtistId. That might actually be interesting. I don't know what that's gonna be. There is no known Track. You have to do album ID and join anyway.

[00:01:37]
Never mind, we couldn't do AlbumId. So, 347, how many things do we have in Album? Is it the same number? It is, that's kind of interesting, right? But now what if we do artist? We have 275 artists, and we do SELECT DISTINCT Album ArtistId, nope, that's not what I wanted to do.

[00:02:17]
Yeah, no, it is. ArtistId from Album. Is that the same number? It's not, which is kind of interesting, right? So, we have 275 artists. However, we only ever see 204 referenced from the album table, which implies the fact that we have 71 artists without albums, kind of interesting, right?

[00:02:47]
So, that's what DISTINCT does. You can also just do SELECT DISTINCT GenreId from Track, and this will actually give you the whole set, right? So, we have, as you might imagine, 1 to 25. And again, if you don't do DISTINCT, it's gonna give you multiple copies of everything, right?

[00:03:11]
So, if I do GenreId, how many copies of everything that I'm getting, right? There's a lot of sevens. I'm gonna guess seven is probably rock, right? So, DISTINCT basically says, collapse this down so, I only have one value per row. Yeah, Mark.
>> Mark: What happens with count if you have a column that contains nulls?

[00:03:40]
>> Brian Holt: That's a good question.
>> Brian Holt: Well, if you're just doing count star, it's gonna count them, right? Cuz there's still rows, right? As long as it's in the results set, it doesn't matter. If you're doing a DISTINCT, my thought, and I think I'm right here, is that it'll count null as just a distinct value, cuz it is a distinct value, right?

[00:04:05]
So, if we had 1 through 25 and null, right? We would get 26 because null would be an additional distinct value in there, that's my thought.
>> Brian Holt: Okay, cool.
>> Brian Holt: So, what if we wanna see which genre has the most,? Seems like a pretty reasonable question to ask.

[00:04:38]
We can do SELECt GenreId and then COUNT(GenreId). When you do these aggregation functions, you'll frequently say like AS count or something like that. This is just aliasing it to be called something else. And it doesn't really matter for what we're doing here because we're not actually gonna be referencing that column later, but in your code, you don't wanna reference the column name as this.

[00:05:10]
You wanna reference it as count, right?
>> Brian Holt: Up to you to do whatever you want, this is totally valid. But in code, I would alias it to be something else. From Track, and then you're gonna do a GROUP BY GenreId. So, you can see 7 has 579. And you can see 1, whatever 1 is, maybe that's rock is has 1300.

[00:05:44]
And 25, there is only one track that has genre 25. Still not super useful, cuz what is 25? Now, I'm deadly curious. I don't remember. What is it? Well, what are we gonna do if we have something that is referenced in another table? Join [INAUDIBLE] We're gonna join them, right?

[00:06:04]
So,
>> Brian Holt: Let's do it. So, I'll just modify this one a bit. We're gonna have to make it reference. So, it's gonna be Track.GenreId. There we go, we're gonna put a Genre Name and then we're gonna do Track. This actually doesn't matter. You can say Track.GenreId or you can do Genre.GenreId.

[00:06:45]
I think you can actually even omit it since it's not ambiguous, but don't. Please just give it a table. Then from Track JOIN Genre ON Genre.GenreId = Track.GenreId. GROUP BY Track, or Genre.GenreId, it doesn't matter cuz they're the same thing, okay?
>> Brian Holt: There's one Track of opera, and rock was 1, that was wrong.

[00:07:27]
Latin was seven.
>> Brian Holt: Kinda cool, right? You're now data scientists. I'm just kidding. Well, not really, I mean, this is a lot of what being a data scientist is. They probably do a little bit more complicated [LAUGH] Queries than this, but sometimes not. I will say that I get by with a lot of times not having a data scientist because I can do basic queries like this, right?

[00:07:58]
Not being afraid of SQL is actually kind of a superpower for a developer.
>> Brian Holt: Cool,
>> Brian Holt: Yeah, and we're using COUNT because I'm gonna say that count is usually the most common one used. You're gonna use that one a lot. But there's definitely others. What if, instead of saying we wanna see how many tracks, we wanna see the highest Track ID for every genre?

[00:08:38]
I don't know why that's useful, but it could be. Instead of saying COUNT, you would just say MAX, right? And you need to do Track ID, not GenreId, because that would be silly. It doesn't make any sense. Track.Track ID
>> Brian Holt: So, you can see the highest rock is 3,355.

[00:09:04]
That's the highest ID that it has on the table. Which one is the lowest? Yeah, heavy metal, they stopped after 1304, Bossa Nova, 660. Rock and Roll, 122. And again, you probably guessed but you can also do a MIN.
>> Brian Holt: Right? And so, the first thing in the table is an ACDC song.

[00:09:32]
We saw that already, so it is one. But there's only one opera song and it's 3451.
>> Brian Holt: So, again, these are called aggregation functions.
>> Brian Holt: Okay, I'm gonna scroll down here a little bit, and we're gonna talk about HAVING
>> Brian Holt: So, what is wrong with this query right here?

[00:10:00]
So, we're selecting Track GenreId, Genre.Name, and the count of how many Track IDs or genres for each track there is from Genre, from Track, JOIN, Genre. And then we're giving it a WHERE COUNt where the count of the GenreId is greater than 500 cuz I only wanna see things that have 500 or more in the database, right?

[00:10:22]
I don't wanna see opera that only has one. Why does this WHERE clause not work? It is kinda confusing. Where works on the result set, right? So, it's gonna run, basically this SELECT, and it's gonna go grab all of the rows that match the things that we're not aggregating, right?

[00:10:49]
So, all the genre IDs and the names. And you're gonna have in this particular case, you're gonna grab every item in the database. So, it's 3503, I think is what we saw it was, right? Then the WHERE clause is then going to go rip out all of the things that don't match that.

[00:11:08]
What's the problem here is that we haven't aggregated yet, right? So, we haven't collapsed everything down yet. So, we can't filter based on TrackId because we don't have the count yet. We haven't filtered out all the things yet. So, this won't work because that only applies to the original results set and not the aggregated set.

[00:11:29]
This is the only time where I think the grammar is important of what order things operated. So, there's basically a second HAVING, if you will, or a second WHERE, called halving. So, instead of doing WHERE, you do HAVING at the end. And that will filter based on the aggregated set not the initial result set.

[00:11:53]
Does that make sense? And it's actually important because you can have a WHERE and you can have a HAVING at the same time, right? You can filter down to a smaller initial result set, aggregate based on that, and then filter based on that aggregated set. So, that's why you have to have both, because they are actually different things.

[00:12:15]
You're filtering in different places, and it makes a difference on where you're filtering, yeah.
>> Student 1: You could do this as a subquery?
>> Brian Holt: You could definitely do this as a subquery. We're gonna talk about subqueries perhaps even next.
>> Student 1: Okay.
>> Brian Holt: Yeah. Like programming, there's many ways to go about doing this, and I just invite you, 999 times out of 1,000, go with the way that makes the most sense and is the most readable because most databases are extremely fast.

[00:12:51]
So, yeah, I'm showing you how to do this with HAVING and JOINS. I won't show you necessarily how to do this with subqueries, but you could definitely do this a couple of other ways.
>> Student 2: So, is the important part that it's the same count in the select and HAVING or is that just coincidence, like the Track.GenreId?

[00:13:16]
Is it important that it matches in the SELECT and the HAVING?
>> Brian Holt: Yeah, I mean, what actually would be a better way of writing this? I mean, let's just run it, just so you can see what it is. That's the result there. Again, if I was just writing this for myself
>> Brian Holt: Way clearer, right?

[00:13:42]
Yeah, that's why, AS is super helpful here.
>> Brian Holt: But yeah, you cannot reference anything aggregated in the where clause. That's the rule for you to hold on to in your brain. If you're doing any sort of COUNT, MAX, MIN, any of that kind of aggregation stuff, DISTINCT is another one sometimes, depends.

[00:14:07]
I guess even still, if you're doing DISTINCT, you're gonna reference what happens in the GROUP BY. I mean the GROUP BYs actually ends up being really important here, right? Because that's actually what you're gonna be collapsing back down on to.

Learn Straight from the Experts Who Shape the Modern Web

  • In-depth Courses
  • Industry Leading Experts
  • Learning Paths
  • Live Interactive Workshops
Get Unlimited Access Now