For this example, I will use Python. Are you a data professional working with a dataset that contains a column of categorical data such as low, medium and high? These aren’t just regular nominal categories such as North, South, East and West. The categories we are working with have an order, and thus we call them ordinal categories.
Are you creating a model, such as a regression model or a decision tree and are you working in Python? If so you will need to encode the low, medium and high column. This could represent three categories of sales for example. It could be years of experience, or just about anything.
You do not want to dummy encode the low, medium high column. There’s a better way. Let’s look at a simple example.
Let’s imagine the column represents salary. Suppose we are working in the human resources department and we are analyzing staff. We are working in the pandas library of Python. Our DataFrame is called df_enc, meaning dataset encoded. What would the pandas code look like?
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # manually create a dataframe from a dictionary of lists lstnum = [6, 6, 5, 5, 5, 6] data = {'firstname': ['Bob', 'Sally', 'Suzie', 'Rowan', 'Sandra', 'Shirley'], 'tenure': [2, 4, 5, 7, 7, 8], 'salary': ['low', 'low', 'medium', 'medium', 'medium', 'high'], 'random': lstnum} df = pd.DataFrame(data) df
# copy the dataset df_enc = df.copy() # Encode the salary column as an ordinal numeric category df_enc['salary'] = ( df_enc['salary'].astype('category') .cat.set_categories(['low', 'medium', 'high']) .cat.codes ) df_enc