Data Visualization First Steps Working with Data
Transcript from the "Working with Data" Lesson
>> Now all of us are busy developers, we have limited time in the day. We don't necessarily want to get a whole other second career in data visualization, maybe some of us do and that's awesome. But maybe some of us just want to kind of get up and running with a data visualization as quickly as possible, so how can we do that?
[00:00:21] So that's why we're gonna be using Observable today. Observable is an in browser, reactive programming environment that is designed for data visualization, it can be used for lots of other things. But if folks have heard of D3, the creator of D3 Mike Bostock, is also the creator of Observable, and Observable actually started out as a project called D3 Express.
[00:01:10] In this case it's essentially, we're just going to be using a declarative library for data visualization called plot. And then, above that code you'll see the latest iteration of whatever code you have entered below. And that also comes with these sort of instant interactive inputs, so essentially HTML inputs that are really quick and easy out of the box that you can then use to make interactive visualizations as well.
[00:01:56] And because it has this reactive, it's sort of based on functional reactive programming, it has this reactive instant feedback environment. And that makes it really easy to iterate on your visualizations and tweak them to get them just how you'd like them. And so, we're gonna be walking through some step by step exercises later in Observable.
[00:02:19] We won't need to learn too much about Observable and how it works, I will explain just a couple of features that we're gonna wanna know. But there's also plenty of links in the materials if you wanna dig deeper with that. Okay, and so the first thing we're gonna need to do or the zero thing, I guess I should say that we're gonna need to do if we want to visualize some data is, have some data [LAUGH] to visualize.
[00:03:09] One thing we can do is upload a local file, maybe I have a JSON file or a CSV file with some small data set that I care about. I can upload that to an Observable notebook and look at it right in the notebook. So we're gonna look at some examples for our projects here as sort of an easy way to get started.
[00:03:30] But also if you choose, you could sign up for an account and kind of upload your own files to be visualizing the data set that you in particular care about. The other thing we can do is connect to files in the cloud like for example Google Spreadsheets or sheets in Microsoft OneDrive or something like that.
[00:03:50] So, if you have a team that uses kinda cloud providers like that, you can connect to some files there, so that you can really quickly show your boss a very impressive visualization of your team's quarterly data or whatever it is. You can also connect to data directly from databases.
[00:04:08] So this is true in Observable, you can connect to a database, maybe you have a MySQL or Postgres database that has some data you wanna visualize. But similarly, if you were building this into your apps, you might have some connection to your database on the server side that then you're visualizing in the front end.
[00:04:27] And finally, you might be fetching data live from some API, whether internal or external. For example, we saw some data earlier from the GitHub API, and we're also gonna be looking at some of that later. So, again we're not I'm gonna be covering in detail how to grab data from these different sources, but there are links in the materials if you wanna read more about how to do that in Observable.
[00:05:10] And that is often called data wrangling, it could also be thought of as kind of transforming data, reshaping data. All of these words are kind of things you might see around the data viz ecosystem. But so in terms of data wrangling, what we're trying to do here is we have some raw data that came in whatever format from, let's say the API response or the way it's stored in the database.
[00:05:35] Or the way it was in the CSV file my boss sent me, whatever it is. But in order to answer the questions that I'm interested in, I might need it in a slightly different format, a slightly different shape, as it were. And so that's where we're gonna be doing a lot of processing on that data in most of our day to day use cases.
[00:05:57] Data wrangling is kind of a really the fundamental prerequisite to being able to make an effective visualization, is getting the data into a shape that plays nicely with the type of visualization to answer the question that you care about. So, in our case, what we gonna be wrangling with today are just some basic built in array methods.
[00:06:22] So things like filter and map, super handy for some basic data wrangling, reduce as well. So we're gonna take a look at one example of just doing some minor data wrangling on this, but again, data wrangling could be its whole own workshop here. And there are some other really great tools and libraries for kind of more advanced, more complex data wrangling use cases.
[00:06:47] And plot itself has a little bit of that stuff baked in, so there are some transformations that we'll be able to do within our visualization library. The same is true of D3, so if you go on and get deeper into the world of D3, you'll learn some more data wrangling skills with that as well.
[00:07:07] So if folks haven't worked much with array methods like filter and map, I like to explain them with what I have adapted to this filter, map, reduce sandwich. So filter is going to be our handy dandy way of basically only looking at the stuff that we care about.
[00:07:24] So imagine, I've got an array of let's say ingredients, and I want to exclude from my final goal, which is a beautiful visualized sandwich, I want to exclude the veggies I guess that I don't like. And we'll just pretend for the sake of this example that I don't like onions and cucumbers, even though that is false for me personally, but let's go with it.
[00:07:48] So I can filter out the things that I don't care about. Now in real life when we're visualizing things, it's probably not gonna be vegetables unless we are working in retail at a grocery store, in which case it totally might be. But what we're going to be doing is, for example, filtering out let's say we're trying to make a visualization of different users, devices.
[00:08:08] And we in this case, don't care about desktop devices, we're just interested in mobile, so we might be filtering out data like that. Then what we might be doing is needing to map from let's say, these whole veggies that we have passed through our filter to sliced veggies.
[00:08:24] So I might have some kind of chop function or something like that, that I'm gonna pass into my map method and apply to each element in my data set. So now, instead of having these whole ingredients, I end up with some chopped ingredients. And then in some cases, we're gonna need to aggregate our data, meaning pull it together and create operations like doing a sum of values in a data set or averages and things like that.
[00:08:52] And often that'll be baked into our visualization tools, for example, plot has some handy transformations that are called we're gonna look at later for doing things like creating aggregations, like sums and counts. But in other cases, we might need to do that in our pre-processing step, in our data wrangling step and reduce is super handy for that.
[00:09:14] So in this case, I might be taking a bunch of sliced veggies, so elements in an array and reducing them with some kind of layer function into a single item, a single sandwich where they're all stacked together, smushed together. And what we might be doing in a data viz use case is, for example, taking a bunch of similar observations and summing up their values, for example.
[00:09:42] So these are some of the questions that you'll need to be asking yourself before you even try to start visualizing the data is, does the data that I have, is it in a shape? Is it in a format that is going to make my life easier when I try to paint it onto the screen?
[00:10:00] If not, am I going to be able to do the transformations that I need to do at the visualization step or are those things that I'm going to need to do in the pre-processing and the wrangling phase? And so this is gonna be different in different use cases.
[00:10:16] But the point I wanna make here is just that this is an important question to be asking ourselves, do I need to do additional wrangling on this data to get it into a convenient format? Okay, and so in terms of what are convenient formats for many visualization libraries and plot in particular, a format called tidy data is a really nice format to work with.
[00:10:42] So what is tidy data? Tidy data means, that you have one row in your data set for each observation, and you have one column for each feature of that observation that you care about. And that means that every value that you care about essentially gets its own cell in the data set.
[00:11:08] So let's take a look at an example of some data that is not tidy. Here I have some population data or perhaps this is health data, we have some cases per population, so maybe I'm looking at some cases of a particular disease. And what we have is some confounding rows here.
[00:11:30] So I'll notice that I have two rows here for each combination of a country and a year. So I have two rows for Afghanistan in 1999, two rows for Afghanistan in 2000, etc. And that's because I have two different types of observations. One is for cases where I get a count here, and another is for population where I get a different count.
[00:11:52] But the thing is, this isn't tidy because again, we said we wanna have one row per observation. And we want every value that we care about to get its own cell, which means that every feature that we care about, and in this case, it seems like we care about the cases observation and the population as well.
[00:12:10] Each of those should really have its own column so that we don't have this weird type column where it's really just selecting between two different features. So instead, if we turn this into tidy data, what we get is a more concise table like this, where we have one row for each observation of a given country in a given year.
[00:12:32] And then we have two separate columns, cases and population. So instead of this sort of confusing type column and the count column, we'll have a value of cases and a value of population. So this is one example of we might need to do some reshaping, some wrangling to get data from a non-tidy format into a tidy format.
[00:12:54] But then again, the most ideal format for your data is going to depend on what question you're asking and what visualization you're trying to build? So the the crucial point is that we need to be looking at this critically and asking ourselves before we even try to draw any colors or bars or lines or what have you, does my data have a convenient shape?
[00:13:19] Once we've got data wrangled, then we can actually do the data visualization. Notice that we're several steps in at this point. Visualization is sort of the part that comes after you've done all of this housekeeping work. And so the process of visualization, we can sort of think of as the process of transforming values in the space of the data set that I have.
[00:13:45] Which are essentially going to be numeric values on some kind of domain, we might call it. Some spread of numeric values, transforming those into values in the graphic on the screen. So that might be pixels of width, or it might be colors, or it might be other features of the visualization.
[00:14:08] And we're going to use a function that we call a scale to essentially map from the values in the data set to the values on the screen that we care about. And so this is what that looks like. Essentially, what we have is we have some values in the data space.
[00:14:26] Which again, if we think back to kind of math class and functions having a domain of input values and a range of output values, the data space is going to be our domain, so our input values. In this case, maybe we have values going from 0 to 10,000, that's not really the values here, just for example.
[00:14:46] And then we're going to transform those via a scale function into values in the visual space, the range or the output of this scaling function. So in this case, we might have a certain number of pixels to work with on our screen, or we might have a certain number of colors in a spectrum that we wanna use to visualize different values.
[00:15:06] So this is really one of the core, core, core things. If there's only one thing to take away from today, it's essentially that visualization is all about understanding how your scaling functions map features from your data, in the space in which the data exists, to features in your visualization, in the space in which the visualization exists.
[00:15:29] So on the one hand, we have these abstract values in the data set and then on the other hand, we have these concrete values on the screen, pixels and things. Okay, so that is what we're gonna be looking at. Today is essentially how we can map different features of the data to different channels.
[00:15:50] We're gonna learn that vocab term in the visualization, using these scales, as essentially functions to do that transformation.