IMDB Title Basics


This entry is part 4 of 9 in the series IMDB Data Files

The titlebasics file is a good file to start with after you have downloaded it. I re-named the downloaded file to titlebasics.tsv. I created a new Excel file. Now I need to use Power Query to import and transform the data.

Power Query – titlebasics

In Power Query, when we “get and transform” data, we need to choose to Transform the data by clicking on the Transform button. Here we need to do a bunch of things. Below is the M Language code for the titlebasics.tsv file. I changed the directory to suit how I store data on my computer. Your folder system will likely be different so adjust the second line below to suit how you work. Also note that in the script below, the data has been filtered to only import movies. This has been further filtered to only include years of 2000 foreward. I used the startYear column to do this.

let
    Source = Csv.Document(File.Contents("D:\IMDB_DataModel_Reduced\RawData\titlebasics.tsv"),[Delimiter="	", Columns=9, Encoding=65001, QuoteStyle=QuoteStyle.None]),
    #"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
    #"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"tconst", type text}, {"titleType", type text}, {"primaryTitle", type text}, {"originalTitle", type text}, {"isAdult", Int64.Type}, {"startYear", type text}, {"endYear", type text}, {"runtimeMinutes", type text}, {"genres", type text}}),
    #"Filtered for movie only" = Table.SelectRows(#"Changed Type", each [titleType] = "movie"),
    #"Removed Columns originalTitle and isAdult" = Table.RemoveColumns(#"Filtered for movie only",{"originalTitle", "isAdult"}),
    #"Replaced \N with null startYear" = Table.ReplaceValue(#"Removed Columns originalTitle and isAdult","\N",null,Replacer.ReplaceValue,{"startYear"}),
    #"Replaced blanks with null startYear" = Table.ReplaceValue(#"Replaced \N with null startYear","",null,Replacer.ReplaceValue,{"startYear"}),
    #"Changed startYear to Number" = Table.TransformColumnTypes(#"Replaced blanks with null startYear",{{"startYear", Int64.Type}}),
    #"Filtered for year 2000+" = Table.SelectRows(#"Changed startYear to Number", each [startYear] > 2010),
    #"Replaced \N with null endYear" = Table.ReplaceValue(#"Filtered for year 2010+","\N",null,Replacer.ReplaceValue,{"endYear"}),
    #"Replaced blank with null endYear" = Table.ReplaceValue(#"Replaced \N with null endYear","",null,Replacer.ReplaceValue,{"endYear"}),
    #"Changed endYear to number" = Table.TransformColumnTypes(#"Replaced blank with null endYear",{{"endYear", Int64.Type}}),
    #"Replaced \N with null runtimeMinutes" = Table.ReplaceValue(#"Changed endYear to number","\N",null,Replacer.ReplaceValue,{"runtimeMinutes"}),
    #"Replace blank with null runtimeMinutes" = Table.ReplaceValue(#"Replaced \N with null runtimeMinutes","",null,Replacer.ReplaceValue,{"runtimeMinutes"}),
    #"Replaced \N with null genres" = Table.ReplaceValue(#"Replace blank with null runtimeMinutes","\N",null,Replacer.ReplaceValue,{"genres"}),
    #"Replaced blank with null genres" = Table.ReplaceValue(#"Replaced \N with null genres","",null,Replacer.ReplaceValue,{"genres"}),
    #"Split genres into 3" = Table.SplitColumn(#"Replaced blank with null genres", "genres", Splitter.SplitTextByDelimiter(",", QuoteStyle.Csv), {"genres.1", "genres.2", "genres.3"}),
    #"Replaced \N with null genre.1" = Table.ReplaceValue(#"Split genres into 3","\N",null,Replacer.ReplaceValue,{"genres.1"}),
    #"Replaced blanks with null genre.1" = Table.ReplaceValue(#"Replaced \N with null genre.1","",null,Replacer.ReplaceValue,{"genres.1"}),
    #"Replaced \N with null genres.2" = Table.ReplaceValue(#"Replaced blanks with null genre.1","\N",null,Replacer.ReplaceValue,{"genres.2"}),
    #"Replaced blanks with null genre.2" = Table.ReplaceValue(#"Replaced \N with null genres.2","",null,Replacer.ReplaceValue,{"genres.2"}),
    #"Replaced \N with null genre.3" = Table.ReplaceValue(#"Replaced blanks with null genre.2","\N",null,Replacer.ReplaceValue,{"genres.3"}),
    #"Replaced blanks with null genre.3" = Table.ReplaceValue(#"Replaced \N with null genre.3","",null,Replacer.ReplaceValue,{"genres.3"}),
    #"Changed all genre to text" = Table.TransformColumnTypes(#"Replaced blanks with null genre.3",{{"genres.1", type text}, {"genres.2", type text}, {"genres.3", type text}})
in
    #"Changed all genre to text"

The Genres of the Titles

Have a close look at the data in the titlebasics file. Each moving picture (title) can have several genres associated with it. Instead of having those genres listed in the title basics file, we want to “normalize” it and bring it out into its own table and then link the two tables together with a relationship. We need to create a Reference of the query. We are using the data in the titlebasic query and going a few steps further. We want to focus on the genre columns and unpivot them.

let
    Source = titlebasics,
    #"Removed Columns" = Table.RemoveColumns(Source,{"titleType", "primaryTitle", "startYear", "endYear", "runtimeMinutes"}),
    #"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Removed Columns", {"tconst"}, "Attribute", "Value"),
    #"Removed Columns1" = Table.RemoveColumns(#"Unpivoted Other Columns",{"Attribute"}),
    #"Renamed Columns" = Table.RenameColumns(#"Removed Columns1",{{"Value", "Genre"}})
in
    #"Renamed Columns"

In the menu system, go to Close and Load, Close and Load To…, and check the Data Model at the bottom of the pop-up. Also choose Only create connection.

Series Navigation<< IMDB and the Excel Data ModelIMDB Unique Genres Query >>

Leave a Reply