Transcript from the "Dataset Solution, Part 1" Lesson
>> Shirley Wu: So this is awesome, cuz I heard a lot of conversation about the data and I'm really, really excited to hear what you've come up with. So I want to go through some of these together. And I hope you'll either read it off or tell me what the data type is and tell me if you think it's important or relevant for later on.
[00:00:18] And then what I want to do is based on the data attributes we list, I want to go through them and ask some questions. So kind of think through that as we're doing the data attributes also. So let's get started. I'm going to be doing it with the markdown file.
[00:00:41] But usually when I'm just doing myself, I'll put it on a piece of paper or I'll put it on my iPad. But whatever works for you in your process. So let's see. So you're going to have to help me. Because apparently, I can't see the attributes and write the text at the same time.
[00:01:01] So you're going [LAUGH] to have to help me read them out, but I do remember that the first thing is title. Although, I don't think I'm going to put them on there, cuz I don't think we can do much interesting things with them. Although I heard somebody say, sentiment analysis and that might be really interesting.
[00:01:19] Yeah, so that could be really interesting. So just in case, I'm gonna put that on there. And when I beta tested this class, there was debate as to what type of data attribute Title could be and we did not have a consensus. [LAUGH] Because we're like is it categorical, but it's unique?
[00:01:37] Is it a unique ID? So yeah, we don't have a consensus. But for now, I'm going to just not give it a data type. [LAUGH] But I think that'd be a really cool idea to do sentiment analysis. So please help me what the next-
>> Speaker 2: Year.
>> Shirley Wu: Thank you.
[00:01:55] Year, so that would be data type-wise?
>> Speaker 3: Temporal.
>> Shirley Wu: Thank you. And then let's list them all out, and then we can go back later on to highlight the ones that we think might be really important, okay.
>> Speaker 2: Released.
>> Shirley Wu: Released, and that is temporal also, cool.
>> Speaker 2: Run-time.
>> Shirley Wu: Thank you. [LAUGH] Thank you, Run-time and what you'd think that would be? Run-time is the length of the movie, right?
>> Speaker 2: It should be temporal or range value.
>> Shirley Wu: So-
>> Speaker 3: Quantitative?
>> Shirley Wu: Yeah, I think I will lean more towards quantitative. Cuz when I think temporal, I think dates.
[00:03:08] Okay, so let's get back into this. Run-time, we just talked about Run-time, which is quantitative. And then the next thing on the list is Genre. And I think there's probably consensus about what this is, right? Yeah.
>> Speaker 2: Categorical.
>> Shirley Wu: Categorical, thank you. Director, where did you put this one?
>> Speaker 3: Categorical.
>> Shirley Wu: Yeah.
>> Speaker 3: I guess.
>> Shirley Wu: I think I actually would really agree with that. Yeah, so I think I'd say, categorical. Because, like Genre, you can imagine potentially being able to group by directors. So you could be like, these are the movies that these directors have directed.
[00:04:00] [LAUGH] So that's why director could potentially be a categorical type of attribute. And then we have, actually, let's put the next few into the same bucket. So there's director, writer, actors. I would argue that all of them of categorical, because you can bucket movies with these. And then I'm going to ignore plot.
[00:04:31] I guess, again, you can do something like the title sentiment analysis, which is really interesting. But let me also make a brief note about potentially missing data or the validity of the data. So for example, I think, if I remember correctly, some of these plots might not be full, as in maybe some of them are cutoff if the plot is a little bit shorter.
[00:04:57] So when you're going through your data, just get a sense if some of the attributes, is it missing something or does it give all of the context that you need, etc? So I'm going to skip plot. There's also language, which I'm guessing, consensus was categorical or did you?
[00:05:28] Categorical. I'm going to ignore country, because this is all filtered by USA. Awards, what do you think about that one?
>> Speaker 2: Could be quantitative, if you're looking at just wins or nominations.
>> Shirley Wu: Yeah, yeah. So quantitative, I think that's a really good one. I think this is one of those that you could probably do a little bit of processing and break it down into a further, like maybe you can create categories for yourself that's like, this one has information about Oscars.
[00:06:08] So maybe for the awards, you just pulled out just the numbers for Oscars and that would be quantitative. Or they have things about wins and nominations. So maybe you pull out Oscars, wins, nominations. And those can be attributes that are quantitative. So I think that's really good.
>> Shirley Wu: And then I'm skipping the poster.
[00:06:31] And then we have some ratings. So we have metascore, which we talked about earlier as quantitative. And then we have IMDb ratings, which we also mentioned before. That's quantitative. And IMDb votes, similarly, that should be quantitative. And type of movie. I think they are all type of movies.
[00:07:01] And then the next one is interesting, DVDs. So DVD release, and that would be temporal. Thank you. Yeah, so that would be temporal. And I think something interesting could be looking at the time between the movie release and the DVD release. Is there anything interesting there? And so the reason why I like listing out data attributes as a very first step is not only do you start to get to know the data and what's there, you kinda start being like, there's potential for this, or this data attribute could help me do this.
[00:07:37] And it starts kinda triggering all these potential possibilities. And so the next one is BoxOffice. And this is actually a really great example. So I am looking at, I think, yes, so this is a really great example again of missing data or figuring out the validity of the data.
[00:07:59] So this particular movie doesn't have the box office figure. But later on, if you look, I think because this is 1999 and it's one of the earlier movies, I wasn't able to pull the box office figures. But there's definitely some others, I think, that do have them. So then this is a design question later on, I think, of, if we do end up using box office, how do we accommodate for the fact that a lot of the earlier movies, maybe ten or so of them don't have the box office figure?
[00:08:36] Do we go and manually put it in ourselves from other data sources? Or do we just ignore those years? So these are some design questions for later, depending on what we decide to do with the designs. So BoxOffice, this one, did we say quantitative? Yeah, cool.
>> Shirley Wu: And then Production.
[00:09:04] So Production, I think this one is movie studios, it seems like. So this is potentially interesting too. So this data site is the top blockbusters. So which big movie studios are kind of coming out with these blockbusters. And so how would you categorize? Yes, that would be [LAUGH] categorical.
[00:09:29] [LAUGH] And then I think those are probably the most irrelevant ones. So this is the list. And then do you want to shout out which ones you find interesting and from this list?
>> Shirley Wu: I see Madeleine going for it.
>> Speaker 3: We talked about seeing when the different ratings map with each other versus deviate, like comparing IMDb rating with Metascore or the other rating, Rotten Tomatoes rating.
>> Shirley Wu: Yeah. Cuz if you open that ratings thing up, there's also Rotten Tomatoes. So actually let me, cuz I think there's, let's say, Metascore. I'm just gonna turn this into Metacritic then, or Metascore and Metacritic, I think is the same. But let's add the Rotten Tomatoes score in there.
>> Shirley Wu: So you're saying, comparing these three.
>> Shirley Wu: And any others that anyone thinks is important to look at or interesting to look at, yeah?
>> Speaker 2: As we've defined the kind of loose type system in parenthesis, something that's interesting to me would be composing some of those categories. And seeing what the relationships, the individual sets are a different composition strategies, joins, associations.
[00:11:04] And basically using that as input dimension for something like more quantitative values, like ratings, votes. So you would end up taking a crack at exposing, like if you put a writer and an actor and a director together, what would be your revenue? And that would start with a compositional step on the categories that we've defined.
[00:11:28] And your second step would be to compare the result of that composition to one of the quantitative values. And I think the dataset, it leaves us a good room to explore in those directions.
>> Shirley Wu: Yeah, so basically, taking something like the director, writers, actors combining them and comparing them to something like the box office figures.
[00:11:51] So what you were saying, basically the question is, which directors, actors together, writers together give the most revenue? So the box office figure, tha'd be really interesting also. Yeah, so I think that also-
>> Speaker 2: Votes could be a proxy for how popular something is.
>> Shirley Wu: Mm, mm-hm. So I am IMDb votes, that's a good one.
[00:12:18] Actually, I think a lot of these could be potentially really interesting, like the-Run time versus DVD, I mentioned, sorry, Release versus DVD, I mentioned earlier. There's a lot of possibility here. So kind of based off what we just, yeah?
>> Speaker 2: We were thinking of something else where it's not there, but just have a competed popularity based on various things we pull out of the data, and it'd be are own function defined, whether it'd be relevant or not, who cares?
[00:12:58] Probably popularity function of some sort.
>> Shirley Wu: I think, yes. I think definitely, while we're in the data exploration phase, I think any idea is interesting, give it a try. And if it sticks, that's great. If it doesn't, no harm there. But the only thing I think I would say is if you decide to make something yourself, don't be afraid to go pull additional data from elsewhere.
[00:13:28] Let's say if you get this dataset and you have other things that you're interested in, like that popularity, don't be afraid to pull that from other different sources. I find that the Internet is a magical place. And a lot of times, you can find a lot of things if you just have the right search term.
[00:13:49] Except for I guess sensitive government data, I think is probably the one thing that's harder to find. But after doing that, I think the only thing I would say is if you do end up making something that's unique to yourself, to have a section in the final visualization at the very bottom that kind of talks about how you did that.
[00:14:11] Talks about your data sources and talks about the calculations, the way you did the calculations. That would be the only thing I would say.