In the forth of four stages of the PACE Framework of Google, lies the Evaluate stage.
The first data model that you produce will almost never be the one that gets deployed. Why? It’s an iterative process. The iterative process provides the information needed to get the model working optimally. After tweaking the parameters or changing how features are engineered in each model, the performance metrics provide a basis for comparing the models to each other and against themselves.
Here are the four main evaluation metrics for supervised categorical machine learning models.
- Accuracy
- Precision
- Recall
- F1 Score
Accuracy
Accuracy reflects the number of correct predictions divided by the total number of predictions. Accuracy is the proportion of data points that are correctly classified. Accuracy is a good metric only when the data is fairly well balanced, meaning that there are roughly the same number of positives and negatives.
Precision
Precision measures what proportion of positive predictions were correct. Precision is calculated by dividing the number of true positives by the sum of the true and false positives. For example, if the model predicted that something was going to be present, how many times was it actually present? This something could be a disease in a person or malware on a computer or something else.
Recall
Recall finds the proportion of actual positives that were identified correctly. Recall is calculated by dividing the number of true positives, by the sum of the true positives and false negatives. Precision is a good metric to use when it’s important to avoid false positives. For example, if your model is designed to initially screen out ineligible loan applicants before a human review, then it’s best to err on the side of caution and not automatically disqualify people before a person can review the case more carefully. Recall is a good metric to use when it’s important that you identify as many true responders as possible. For example, if your model is identifying poisonous mushrooms, it’s better to identify all of the true occurrences of poisonous mushrooms, even if that means making a few more false positive predictions.
F1 Score
F1 score combines both precision and recall in one metric. This combination is known as the harmonic mean.
Unbalanced Datasets
Precision, recall and F1 score are especially useful for measuring unbalanced classes. Accuracy is not.
Spam Detection Example
What does a “false positive” mean. Suppose you have a model that attempts to predict spam. It looks for email characteristics that might indicate it is spam. A false positive means that the model predicted positive (a spam) but it really was not a spam. So the first word is reality and the second word is the prediction. A “false negative” would mean that the model predicted that it was not spam, but in reality it was spam. Where is the risk? There is a cost to being too aggressive and putting too many good emails into the spam folder. You could easily miss important messages. Precision is a good metric to use for spam detection.
In this spam example, you may have built a binomial logistic regression model.
Fraud Detection
In this example, you are building a model that is trying to detect fraudulent transactions. A “false negative” is predicting a transaction that is actually fraudulent as safe. We don’t want bad transactions to slip through. We want to be aggressive in predicting fraud. A false positive isn’t so risky. This is where we thought it was fraud, but it really wasn’t fraud. In this case, recall is a good metric to use.
What kinds of business problems would be best addressed by supervised learning models? What requirements are needed to create effective supervised learning models?