Feature Engineering


Data professionals give careful consideration to all of the variables (columns, features) in their dataset. Machine learning professionals calls those variables features. Feature engineering is a step that goes beyond exploratory data analysis (EDA).

Feature engineering, is the process of using practical, statistical, and data science knowledge, to select, transform, or extract characteristics, properties, and attributes, from raw data. There’s a lot to unpack in that definition. When does that happen? Looking at Google’s PACE framework, feature engineering happens during the Analyze phase, right after EDA and just before managing and class imbalances.

Features, or variables, or columns, can be continuous, discrete or categorical. Continuous variables are variables with values obtained by measurement and they can take on an infinite, and uncountable set of values. On the other hand, categorical variables are variables that contain a finite number of groups, categories, or countable numerical values (which are “discrete”).

The three general categories of feature engineering are feature selection, transformation, and extraction.

Feature Selection

The goal of feature selection is to select the features (variables) in the data that contribute the most to predicting your response variable. You will drop features that do not help in making a prediction. Here’s how you drop a column in pandas, and how you rename a column in a DataFrame.

This process can be extensive, challenging and even require multiple rounds of EDA and feature engineering. If you have a feature that is continuous, such as outside temperature, would you want to convert that to hot, mild and cold? If you have location zip code or postal code, would you want to convert that to levels of quality of location such as excellent, good, fair and poor?

Feature Transformation

In feature transformation, data professionals take the raw data in the data set, and create features, that are suitable for modeling. This process is done by modifying the existing features, in a way that improves accuracy, when training the model. For example, you could define some cut off points for the temperature data, and create a new categorical feature, from the numerical data that is either hot, warm or cold.

Feature Extraction

Feature extraction may involve taking multiple features to create a new one, that would improve the accuracy of the algorithm. For example, imagine we want to create a new variable called Muggy, that could be used to model whether or not we play soccer. If the temperature is warm, and the humidity is high, the variable Muggy would be true.

Feature extraction may involve taking a feature, such as a one-to-five rating from customers and changing that to a binary variable, positive or negative, or perhaps good or bad. You might have survey results that ask people to rate things from 1 to 4 or from 1 to 5 (or whatever) and you want to convert those numbers to either satisfied or not satisfied.

Frequency

It’s also necessary to understand the frequency in which the variables exist. To recap, we want to understand what are variable are, how they are structured and the frequency.

For classification problems, you need to specifically understand the frequencies of the response variable. As a data professional, you might encounter datasets that are unequal in terms of their response variables. Suppose you are trying to detect fraudulent transactions, or perhaps you are trying to detect spam emails. You could have millions of examples of nonfraudulent transactions and only a few thousand examples of actual fraudulent transactions.

Feature engineering is part of the Analyze stage of Google’s PACE framework.

Leave a Reply