Logistic Regression in Python


This entry is part 3 of 3 in the series Logistic Regression

This Python project example was done using Jupyter Notebook in Anaconda Navigator with made-up fake made-up data on the incidence of arthritis in patients. I created a project called Logistic Regression Arthritis Fake Data. I did this project on my local machine. This project example is here to help us understand what the algorithm might be for your own project.

This project is based on projects in the course at Coursera called Google Advanced Data Analytics, Regression Analysis: Simplify Complex Data Relationships. Since we are now actually constructing the model, we are in the Construct phase of Google’s PACE Framework.

Made-up Fake Arthritis Data

Below is the Python project. I will move quickly through it by combining code cells together. In a real project, you would split some of these cells and include more explanation. I also have not included all of the explanatory notes that I could have. The notes are in markdown cells. The data is fake and very contrived. You can see that the ages are just a series of numbers in order starting at age 20 and going up by one year each observation.

# import
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.metrics as metrics # Import the metrics module from scikit-learn
# Load in if csv file is in the same folder as notebook
arth = pd.read_csv("LogRegressArthritis.csv")
df = arth.copy()
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 50)
df.head(25).T   # dot T transposes

Download the LogRegressArthritis.csv I used. Download it if you wish. In our data, a value of zero in the Arthritis column means that the person does not have arthritis, and one means they do have arthritis.

# Load in sci-kit learn functions for constructing logistic regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Save X and y data into variables
X = df[["Age"]]
y = df[["Arthritis"]]

# Split dataset into training and holdout datasets
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=4)
clf = LogisticRegression().fit(X_train,y_train)
# Print the coefficient
clf.coef_

# Print the coefficient
clf.coef_
array([[0.03329781]])

# Print the intercept
clf.intercept_
array([-3.1602916])

# Plot the logistic regression and its confidence band
fig, ax = plt.subplots()
fig.set_size_inches(5, 3)
sns.regplot(x="Age", y="Arthritis", data=df, logistic=True, ax=ax)

Click to Enlarge

Confusion Matrix

A confusion matrix is a graphical representation of how accurate a classifier is at predicting the labels for a categorical variable.

# Split data into training and holdout samples
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Build regression model
clf = LogisticRegression().fit(X_train,y_train)
# Save predictions
y_pred = clf.predict(X_test)

In order to understand and interpret the numbers in the below confusion matrix, it is important to keep the following in mind:

  • The upper-left quadrant displays the number of true negatives.
  • The bottom-left quadrant displays the number of false negatives.
  • The upper-right quadrant displays the number of false positives.
  • The bottom-right quadrant displays the number of true positives.

We can define the above bolded terms as follows in our given context:

  • True negatives: The number of people that did not have arthritis that the model accurately predicted did not have arthritis.
  • False negatives: The number of people that did not have arthritis that the model inaccurately predicted did not have arthritis.
  • False positives: The number of people that had arthritis that the model inaccurately predicted not having arthritis.
  • True positives: The number of people that had arthritis that the model accurately predicted as having arthritis.

A perfect model would yield all true negatives and true positives, and no false negatives or false positives.

# Calculate the values for each quadrant in the confusion matrix
cm = metrics.confusion_matrix(y_test, y_pred, labels = clf.classes_)
# Create the confusion matrix as a visualization
disp = metrics.ConfusionMatrixDisplay(confusion_matrix = cm,display_labels = clf.classes_)
# Display the confusion matrix
fig, ax = plt.subplots(figsize=(4,4))
disp.plot(ax=ax)

Click to Enlarge



Series Navigation<< Binomial Logistic Regression

Leave a Reply