Data Visualization First Steps

Aggregation & Grouping Solution

Anjana Vakil

Anjana Vakil

Software Engineer & Educator
Data Visualization First Steps

Check out a free preview of the full Data Visualization First Steps course

The "Aggregation & Grouping Solution" Lesson is part of the full, Data Visualization First Steps course featured in this preview video. Here's what you'd learn in this lesson:

Anjana walks through the solution to the aggregation and grouping exercise.


Transcript from the "Aggregation & Grouping Solution" Lesson

>> This was a little trickier than our last examples, you're doing it right and if this was a little bit of a struggle, it just goes to show why sometimes doing the number crunching in the wrangling phase as opposed to during the visualization interface can be a helper.

Yeah, questions from the chat.
>> One of the objects has a Resolution value of not set. How would you recommend that we handle that just use null values for the width and height?
>> So good question. I guess it depends on, in terms of a real world application.

It depends on what you want to do when you're missing data values from your data set. So, in some cases, like for example, when we're converting to numeric values, we might just be dropping those missing numbers, or you might be choosing some kind of. Default value that you want to stand in, for missing values like zero for example, so we'll see in our next exercise when we're plotting the resolution values in kind of more detail.

We'll see that we might end up with some zero values that are artifacts from those kind of missing data values. So one thing you could do if you wanna exclude those from your visualization, is you could use a filter operation like we said before, to make sure that any rows of the data that are missing values that you care about don't get included in the final data set.

Or if you want to be visualizing those separately as maybe a different category, you could be doing that in your data wrangling phase as well. Cool, good question. Okay, so let's see if we can get this user total bar chart happening here. So what I've done is I've just pasted in our previous chart with the count.

And so we said that in this one, if we open up these hints. The calculation that we wanna perform here instead of count, is the sum of a particular value. But if I replace sum here, instead of count, I'll notice that nothing really happens to my chart. And that's because plot is still essentially summing up observations, and so in this case it's kind of the same thing as sum versus account when we're talking about observation, so instead and we're telling plot.

Hey sum up whatever variable, whatever property is on the y axis, what we need to do is specify which variable we care about for the y axis by passing it into our inputs configuration as well. So in this case we care about that users property, so what I'm going to do is add in to our inputs here, in addition to device being along the x axis.

At the end of the day, what we want on the y axis is the number of users. It's just that instead of the direct number of users for each different observation, as if they were all little tiny bars along the x axis. We wanna smush them together and just get one bar with its y value set at the sum of the users values for that device category.

So by declaring the same channel in our inputs as our outputs, we're telling plot, hey taken users for that y axis value. But instead of outputting the original values, group them together by device category and sum up those y values. Some of those users count for the y values.

So let's see what happens when we do that. So now we see that the units have changed. And I think maybe we just need to add a little bit more margin here. To see the numbers there we go, okay, they're not all zeros. And again, we cannot configure the size and shape of our plot here.

We could do other things like configure the, we could change this to be in the millions of users by changing the units in our data. And so now we have got actually we see that the difference is even more stark, we have a lot more users for desktop devices than we do for tablets for example.

Okay, now what else did we wanna do here, we wanted to try to color the bars by device category, did anybody figure out the way to fulfill in here? So in this case we're not actually doing any aggregation on the colors or the device categories, so we're just going to include another input channel here that's not going to get transformed at all.

That's going to fill the bars by the device category. And again we already saw in our last exercise or our last project how we can override the color scales. Range values, if we wanna have different colors for these. But in this case, we don't really need to map them to any kind of meaning, we just need to distinguish the three different kinds, so the default colors will work fine.

Now, what about the title? We said we would be cool to add a tooltip with the actual value of the total number of users. So, this is going to follow a similar pattern to the way that we declared the users. Some should come through on the y channel, we're going to do something similar with the title channel.

So in this case we're gonna add in the title as the users value, to our inputs here and we're gonna perform that same sum operation on the title channel that we did on the y channel. So that now, hopefully when we hover, we see a tooltip with the total number of users for each category.

Eventually we have to wait a little while for it to show up. Okay, any questions so far on that excercise? Yeah, I think we have one question, yeah.
>> Yes, so for this y has a title of sum. So when we hover over the bar and see this number, are we seeing the y's sum of title or are we seeing the x's title of users?

>> So what we're seeing with each of the channels is that, we've declared in these outputs of this group transform, is the sum of whatever value we've passed into that channel. So here we're seeing two different representations of the same value. Where the value is the sum of the user's channels.

So first of all, we're seeing it on the y channel. So we're seeing it with the height of each of these bars. That corresponds based on our output domain here sorry, our output range, corresponds to the relative input values from our domain. But then also when we hover over them, we are seeing that same value encoded in the other channel, the title channel, which we have also passed in the same users.

Column, the same users value due to that transform. So essentially we're telling plot to do this same calculation, and express it as two different channels. So express it on the one hand as the height of each of these bars, the y channel, and on the other hand as the title on each of these, Bar elements that shows up as that little tooltip when we hover over.

So essentially we're similarly to in our last project, we kind of reduplicated the success or failure state through both the film channel and the stroke channel. It's a similar approach here. It's just we have the added complication of there's an aggregation transform happening. So in both cases, we're grouping that number of users based on the three different values for device, that we have in our dataset.

Okay, so we've done our first aggregation operation pretty cool. And now we have a more useful chart. And we'll leave it as an exercise to the reader to play around with if you wanted to, let's say aggregate by the number of sessions that we saw in the dataset instead of the number of users.

I bet at this point, you can figure that out real quick.

Learn Straight from the Experts Who Shape the Modern Web

  • In-depth Courses
  • Industry Leading Experts
  • Learning Paths
  • Live Interactive Workshops
Get Unlimited Access Now