Transcript from the "Exploration Exercise" Lesson
>> Shirley Wu: If you go to this part with the data exploration, I also have the full version of my notebook which honestly I think isn't as good as explorations we did today. So the way that I went was a lot about seasonality. The questions I had were like when are most blockbusters released, summertime, is it holidays.
[00:00:23] And I wondered about the spread of metascores, what genres are common, and actors etc. And then what I ended up finding is I ended up doing some bar charts of which months have the most number of movies, and like we saw with the heat map, it tended to be summer or winter holidays.
[00:00:49] And I looked at the different genres. So I did a really ugly stacked bar chart of all of the genres and which months they were released which turned out to be really ugly. So I tried to do an area chart, that was a little bit easier to read.
[00:01:06] And so at the end, the thesis, I wanted to design around is these two points that I found that blockbuster movies are highly seasonal. And out of those, there are some genres like action, adventure, animation, etc, that are the most common. I wanted to just kind of go through a little bit of things that I've learned over the years that have helped me a little bit.
[00:01:32] Just because I don't come from a data analysis background, data engineering, or data science background, these are some of the things that I've kind of learned to do over the years to help try and make up for that lag. So the first one I mentioned before is check for missing data.
[00:01:51] And check for about the validity of the data and then account for that, either in going out and getting the data or noting in your kind of footnotes that there are some data that are missing. And while you're going through and exploring the data, try and focus on just one of the questions at a time because it's very easy to get sidetracked with a tangent, it's very easy to be like, this is super interesting, let me go down this rabbit hole.
[00:02:22] Especially if you have a big, big dataset that's really interesting, you might just get sidetracked by a bunch of different tangents and not really get anywhere. So focus on just that one question you started out with. And try and figure out the answers for that, dig into it until you have a good answer.
[00:02:43] But if there is an interesting tangent, make a note about it for later because if your original question leads to a deadend, then you can explore another question on your list or that tangent you found earlier that's really interesting. And don't be afraid to go out and look for additional data to help with your exploration.
[00:03:07] And finally, sometimes when we find no interesting patterns in our data, that really is a very interesting point. And one of the times when I really learned that is I had a client project with the Guardian about a dataset that was about the homeless bussing in US cities.
[00:03:30] And we were trying to look at this bussing program is where cities will buy a one way ticket for homeless people to another city. We were trying to find if there was any seasonal trends, if people tended to get these bus tickets more in summertime or something, or if there was trends in terms of geography and we just couldn't find anything.
[00:03:55] We couldn't find any interesting patterns. And that's when we realized the reason why there's no interesting patterns is because one of the requirements for the bussing program is that the homeless person has to have someone on the other side in that destination city that will kind of receive them and take care of them.
[00:04:17] And so that's why there's no seasonal patterns or geographic patterns because it's not people are traveling at their own convenience, it's that they are traveling to someone, a family member, or a friend or somebody that can hopefully take care of them.
>> Shirley Wu: And that's why there's no pattern because it's more a random human connection sort of thing instead of any seasonality or geographic thing.
[00:04:44] And so that's when I learned new interesting pattern is just as interesting.