The Internet Movie Database (IMDB.com) is a large website containing information related to films, television programs, home videos, video games, and streaming content online – including cast, production crew, and personal biographies, plot summaries, trivia, ratings, and fan, and critical reviews.
Download Part of the Database
If you work with data on a computer you may be interested to know that you can download parts of that database. Go to the IMDB Datasets webpage to get the link to the set of links you can use to download the data. Near the top of that page you can see the link as https://datasets.imdbws.com/. Here is a list of the files you can download to your own computer.
- name.basics.tsv.gz
- title.akas.tsv.gz
- title.basics.tsv.gz
- title.crew.tsv.gz
- title.episode.tsv.gz
- title.principals.tsv.gz
- title.ratings.tsv.gz
title.basics.tsv.gz
This one gets you started. It is an gz file that needs to be unzipped to a text file that has its elements separated with the Tab. Once its unzipped, with a program like 7-Zip, you get a file called data.tsv. TSV is an acronym for tab-separated values. I suggest re-naming the data.tsv file to name.basics.tsv. This file is very large with over 8 million rows! You cannot entirely open it with Notepad, Notepad++ or MS Excel. However, the Large Text Viewer app is able to open it and you can use Excel and load it into the Power Pivot Data Model. Here is a list of the columns in title.basics.
- tconst – this is the primary key
- titleType
- primaryTitle
- originalTitle
- isAdult
- startYear
- endYear
- runtimeMinutes – some of the data are blanks,
- genres
tconst is a string that uniquely identifies a moving picture, such as a movie. If you were to find the movie Snowden in the file you would find that the tconst is tt3774114, titleType is movie, primaryTitle is Snowden, originalTitle is Snowden, isAdult is 0, startYear is 2016, endYear is \N, runtimeMinutes is 134, and genre is Biography,Crime,Drama. The endYear is “\N”. In Excel’s Power Query we need to replace the “\N” with null.
title.ratings.tsv.gz
This file has 3 columns: tconst, averageRating, and numVotes. It has just over 1.2 million rows.
name.basics.tsv.gz
This file is about the people. It has over 11 million rows.
- nconst
- primaryName – the full name such as Fred Astaire
- birthYear – some may be “\N”
- deathYear – some may be “\N”
- primaryProfession- a csv list such as actress,soundtrack,producer
- knownForTitles – a csv list such as tt0050419,tt0053137,tt0072308,tt0031983
title.crew.tsv.gz
This contains the director and writer information for all the titles in IMDb. The columns include tconst, directors, and writers.
Understanding the Files
To understand the files and data of the IMDB database better, have a look at our series called Tiny IMDB.