One-Hot Encoding of Categorical Variables


Are you a data analyst or are you working on a data analysis project and you are wondering about how to deal with the categorical column of data in your dataset? Machine learning algorithms do not know what to do with categorical variables, (aka categorical features) so we need to convert categorical features into what is known as a dummy variable using pandas.

What is one-hot encoding and what are categorical variables? First of all, we need to understand what a categorical variable is. Categorical data is data that is divided into a limited number of qualitative groups. We can call them groups or labels. These labels may just be names. For example, North, South, East and West. These labels are “in name only” and are therefore called nominal items. They have no particular order.

Ordinal items do have an order. In an ordinal relationship, the categorical items have a prescribed order. For example, “small, medium and large”. Another example would be “first, second, third”. Another would be “first class, econo class and coach”.

Data Analytics and Machine Learning

One hot encoding is a data transformation technique that turns one categorical variable into several binary variables. A binary variable is either 0 or 1. You will find yourself using one-hot encoding for many variables in your dataset.

Many of the columns are categorical and will need to be dummied (converted to binary). Some of these columns are numeric, but they actually encode categorical information. To make these columns recognizable to the get_dummies() function as categorical variables, you’ll first need to convert them to type(str).

  1. Define a variable called cols_to_str, which is a list of the numeric columns that contain categorical information and must be converted to string.
  2. Write a for loop that converts each column in cols_to_str to string.
cols_to_str = ['CodeID', 'VendorID']
for col in cols_to_str:
    df1[col] = df1[col].astype('str')

Suppose you have a dataset in pandas that is named df1. Here is some example code.

df2 = pd.get_dummies(df1, drop_first=True)
df2.info()

Do You Use One-Hot Encoding on Ordinal Variables?

No. You do not. If you have a hierarchy to your categorical variables, do not use get_dummies() on them. Instead, convert the levels to numbers.

Leave a Reply