Multiple Linear Regression for Penguins


This entry is part 3 of 2 in the series Multiple Linear Regression

In this post I will use Python to perform a multiple linear regression (MLR) analysis. The data will come from Palmer Penguins. I’m using Anaconda Navigator to load Jupyter Notebooks to do this locally on a laptop. I created a project called “MLR Penguins 2024 Jan”.

This example is a quick look at the Python code with minimal explanation of how this really works, so that we can see the code and algorithm easily. The main part of the program sets the X and Y variables, creates data sets, sets up the OLS formula, and fits the model. Of courrse before all of that we do some planning, import the data, explore the data and clean the data. At the end of the program we interpret the results and then we would communicate the results to the stakeholders.

# Import packages
import pandas as pd
import seaborn as sns
# Load dataset
penguins = sns.load_dataset("penguins")
# Examine first 5 rows of dataset
penguins.head()

Click to Enlarge

Subset data to needed columns. We don’t need to work with all of the columns in the Palmer Penguins dataset.

# Subset data to needed columns
penguins = penguins[["body_mass_g", "bill_length_mm", "sex", "species"]]
# Rename columns
penguins.columns = ["body_mass_g", "bill_length_mm", "gender", "species"]
# Drop rows with missing values
penguins.dropna(inplace=True)
# Reset index
penguins.reset_index(inplace=True, drop=True)

# Examine first 5 rows of data
penguins.head()

# Subset X and y variables
penguins_X = penguins[["bill_length_mm", "gender", "species"]]
penguins_y = penguins[["body_mass_g"]]
# Import train-test-split function from sci-kit learn
from sklearn.model_selection import train_test_split
# Create training data sets and holdout (testing) data sets
X_train, X_test, y_train, y_test = train_test_split(penguins_X, penguins_y,test_size = 0.3,random_state = 42)
# Write out OLS formula as a string
ols_formula = "body_mass_g ~ bill_length_mm + C(gender) + C(species)"
# Import ols() function from statsmodels package
from statsmodels.formula.api import ols

# Create OLS dataframe
ols_data = pd.concat([X_train, y_train], axis = 1)
# Create OLS object and fit the model
OLS = ols(formula = ols_formula, data = ols_data)
model = OLS.fit()

Get the model summary. Now we need to interpret the results.

# Get model results
model.summary()

Here is a screenshot of the results.

Click to Enlarge

Now we can interpret and evaluate the model. In the upper part of the table, we get several summary statistics. We’ll focus on R-squared, which tells us how much variation in body mass (g) is explained by the model. An R-squared of 0.85 is reasonably high, and this means that 85% of the variation in body mass (g) is explained by the model. In the lower half of the table, we get the beta coefficients estimated by the model and their corresponding 95% confidence intervals and p-values. Based on the p-value column, labeled P>|t|, we can tell that all of the X variables are statistically significant, since the p-value is less than 0.05 for every X variable.

C(gender) – Male. Given the name of the variable, we know that the variable was encoded as Male = 1, Female =0. This means that female penguins are the reference point. If all other variables are constant, then we would expect a male penguin’s body mass to be about 528.95 grams more than a female penguin’s body mass.

C(species) – Chinstrap and Gentoo. Given the names of these two variables, we know that Adelie penguins are the reference point. So, if we compare an Adelie penguin and a Chinstrap penguin, who have the same characteristics except their species, we would expect the Chinstrap penguin to have a body mass of about 285.39 grams less than the Adelie penguin. If we compare an Adelie penguin and a Gentoo penguin, who have the same characteristics except their species, we would expect the Gentoo penguin to have a body mass of about 1,081.62 grams more than the Adelie penguin.

Bill length (mm). Lastly, bill length (mm) is a continuous variable, so if we compare two penguins who have the same characteristics, except one penguin’s bill is 1 millimeter longer, we would expect the penguin with the longer bill to have 35.55 grams more body mass than the penguin with the shorter bill.

Series Navigation<< Multiple Linear Regression Introduction

Leave a Reply