Logistic Regression Introduction


This entry is part 1 of 3 in the series Logistic Regression

What is logistic regression? It is part of the general statistics area of regression. Some other types of regression are simple linear regression, and multiple linear regression.

If the variable we are trying to predict is categorical, we could build a logistic regression or a tree-based machine learning model. Note that tree-based models can also handle continuous target variables.

Use Cases

Do you want to predict the probability of an event occurring? For example, will a customer buy from you again? What are the chances that a customer will churn (leave) an organization, such as a bank? What are the factors that contribute to a user commenting on a website? Will a given player on the basketball team score more than 12 points in the next game? We could look at that player’s average points per game from last season or their average playing time per game. Does a particular product feature result in happier customers (based on a customer survey)? What predicts the likelihood that an employee receives high performance ratings? Can we predict if an incoming email is spam based on its features and then divert it to the span folder? Could we predict if a person has a certain disease based on some health data? Can we predict if a bank customer will default on a loan? These are examples of binary classification.

What id Logistic Regression?

Logistic regression is a technique that models a categorical dependent variable Y based on one or more independent variables X. Logistic regression is similar to multiple linear regression except that the outcome is binary. The dependent variable can also be called the outcome or target variable. The outcome variable can have two or more possible discrete values.

Binomial logistic regression models the probability of an observation falling into one of two categories based on one or more independent variables. We normally use the binary variable Y to indicate the category. Logistic regression is quite sensitive to outliers.

Before looking at logistic regression, you’ll want to first study linear regression and then multiple linear regression. A small bit of probability knowledge helps also.

What are the binomial logistic regression model assumptions? The linearity assumption is the first and most important assumption. It states that there should be a linear relationship between the logit of the probability that Y equals 1. Understanding this requires a technical discussion of the logit function. The observations are independent. We also assume that there is little or no multicollinearity between the independent variables. The fourth assumption is that there are no extreme outliers. Outliers can be detected after the model is fit and can be either transformed or removed, depending on the situation.

  • Outcome variable is categorical
  • Observations are independent of each other
  • No severe multicollinearity among X variables
  • No extreme outliers
  • Linear relationship between each X variable and the logit of the outcome variable
  • Sufficiently large sample size

Use Cases for Logistic Regression

Logistic regression is a classifier. What can we use it for?

  • Will an email be Spam or “Ham” (not spam)?
  • Will someone default on their loan or mortgage?
  • Will a bank customer churn (leave) to another bank?
  • Will a starting basketball player score more than 12 points in his next game?
  • Will a visitor to our website write a comment or not?

When working with linear and logistic regression models, what type of evaluation metrics will we be using? We’ll be using metrics such as R squared, mean squared error, area under the ROC curve, precision, and recall to evaluate the effectiveness of the model.

Series NavigationBinomial Logistic Regression >>

Leave a Reply