IMDB and the Excel Data Model


This entry is part 2 of 9 in the series IMDB Data Files

What if we wanted to import this database into Microsoft Excel and work with it? We want to analyze the data. For example, we want to know how many movies are produced in each of the last 40 years and in what genre do they fall. Perhaps you have many more questions to pose against the data. Maybe you want to see a list of the highest rated movies and their rating in the last 20 years. What is the average rating of all movies in the last 50 years for each genre?

This post is just an overview of the algorithm I used to bring in the IMDB data into Excel. It does not list all the details of every step. For that, I have other posts and series of post that go into more detail about Power Query and Power Pivot.

The IMDB database a very large database. It has millions of rows. We need to use the Excel data model to bring it in. We also need to use Power Query because the data itself needs cleaning. For example, some of the years have “\N” stored in them. That really means null, so in Power Query we need to tansform the data. We also will need to split columns.

The first step would be to download the data from the Web and put it in a folder. Extract it (decompress it). Now create a new Excel file. I called mine IMDB2.xlsx. We will use this file to connect to the three sources of data we want: title.basics, title.ratings and name.basics.

Series Navigation<< Internet Movie Database (IMDB)IMDB Repeating Genre Values >>

Leave a Reply