IMDB Repeating Genre Values


This entry is part 4 of 9 in the series IMDB Data Files

In working with the IMDB database tables, you find that each moving picture can have one or more genres that it belongs to. For example, it may be a Comedy and a Drama. In the database, we find a list of genres at the intersection of a row and a column. Each genre is separated by a comma, as you can see in the screenshot below. This screenshot is of the Power Query Editor’s Current View pane/window after some transformations have been done. This screenshot below is of the titlebasics query.

What we need to do is create another table and then link that new table with this titlebasics query. When we have that done, we can perform many queries such as “show a list of all the titles that are Comedies, sorted alphabetically”. The new table will have a column that identifies the moving picture. That’s the tconst column. The second column will contain a genre name. If a moving picture is part of two genres, the new table will have two rows where the identifier for the moving picture is listed twice in the first column. How do we create that table?

The first thing is to split that column into multiple columns. First, select the column. Go to the Transform tab. Choose Split Column.

Although not visible in the screenshot above, if a title is only a Drama and no other genre, the transformation will automatically, and fortunately, insert a null value in the other two genre columns. This becomes important in a moment when we unpivot because we only want one row for this title if it only has one genre. If it has three genres, then when we unpivot and turn those columns into rows, we want to get three rows. The primary key of the titlebasics table will become a foreign key in this new table.

The next thing is to create the second table based on this table. Go to the left side of the Power Query Editor under the Queries pane and select the query called titlebasics. Right-click and choose Reference.

A new query is created and given a name called titlebasics (2). Rename that query to titleGenres. Note that the Source of this new table is titlebasics. We want all of the transformations to happen, then with the data that’s left, we want to create our new Genres table.

We need to remove all of the columns except for the ones that we need. We need the tconst and the genre columns. We can just select them and press the Delete key, or use the Remove Columns button in the Home tab.

Now we need to turn these genre columns into rows, How do we do that? Unpivot. Go to the Transform tab. Select the tconst column. Go to the drop-down arrow of Unpivot Columns, Unpivot Other Columns. You will see something like the following.

We don’t need the middle column so we can delete it. We should rename the third column to Genre from Value.

M Language in Power Query

Here is the steps from the Advanced Editor of Power Query. Notice that the source is titlebasics. These steps are listed in the Applied Steps of the Query Settings pane shown on the right-side of the Power Query Editor when your query is selected in the Queries pane on the left-side.

let
    Source = titlebasics,
    #"Removed Columns" = Table.RemoveColumns(Source,{"titleType", "primaryTitle", "startYear", "endYear", "runtimeMinutes"}),
    #"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Removed Columns", {"tconst"}, "Attribute", "Value"),
    #"Removed Columns1" = Table.RemoveColumns(#"Unpivoted Other Columns",{"Attribute"}),
    #"Renamed Columns" = Table.RenameColumns(#"Removed Columns1",{{"Value", "Genre"}})
in
    #"Renamed Columns"

Duplicate or Reference?

Duplicate creates a second copy of your existing query that’s independent to the existing query. The copied version is completely independent to the original version. Whereas reference creates a new query that’s dependent on the existing query. Use Reference when you’d like the starting point of your new query to be the ending step of your existing query.

Excel Power Pivot

In Excel, go to the Power pivot tab in the Ribbon. Click Manage. You can also get there from the Data tab and the Data Tools group. In the Home tab click Diagram View. Create the relationship beteen our two tables.

As a housekeeping chore, we will hide the three genre columns so that users creating pivot tables will be sure use our new table when using the genres, and not use any of the leftover columns.

Summary of the Steps

Here are the steps involved in creating the Genres table.

  1. Go to the Power Query Editor
  2. In the Queries pane, right-click the titlebasics query and select Reference
  3. Rename the new query to titleGenres
  4. Do the transformation steps shown above in the listing and described in this post
  5. Select the Home tab of the ribbon and Close and Load
  6. Go to the Power Pivot window and create a relationship btween titlebasics and titleGenre.
Series Navigation<< IMDB and the Excel Data ModelIMDB Unique Genres Query >>

Leave a Reply