- Multiple Linear Regression Introduction
- Multiple Linear Regression for Penguins
In this post I will use Python to perform a multiple linear regression (MLR) analysis. The data will come from Palmer Penguins. I’m using Anaconda Navigator to load Jupyter Notebooks to do this locally on a laptop. I created a project called “MLR Penguins 2024 Jan”.
This example is a quick look at the Python code with minimal explanation of how this really works, so that we can see the code and algorithm easily. The main part of the program sets the X and Y variables, creates data sets, sets up the OLS formula, and fits the model. Of courrse before all of that we do some planning, import the data, explore the data and clean the data. At the end of the program we interpret the results and then we would communicate the results to the stakeholders.
# Import packages import pandas as pd import seaborn as sns # Load dataset penguins = sns.load_dataset("penguins") # Examine first 5 rows of dataset penguins.head()
Subset data to needed columns. We don’t need to work with all of the columns in the Palmer Penguins dataset.
# Subset data to needed columns penguins = penguins[["body_mass_g", "bill_length_mm", "sex", "species"]] # Rename columns penguins.columns = ["body_mass_g", "bill_length_mm", "gender", "species"] # Drop rows with missing values penguins.dropna(inplace=True) # Reset index penguins.reset_index(inplace=True, drop=True)
# Examine first 5 rows of data penguins.head()
# Subset X and y variables penguins_X = penguins[["bill_length_mm", "gender", "species"]] penguins_y = penguins[["body_mass_g"]] # Import train-test-split function from sci-kit learn from sklearn.model_selection import train_test_split # Create training data sets and holdout (testing) data sets X_train, X_test, y_train, y_test = train_test_split(penguins_X, penguins_y,test_size = 0.3,random_state = 42) # Write out OLS formula as a string ols_formula = "body_mass_g ~ bill_length_mm + C(gender) + C(species)" # Import ols() function from statsmodels package from statsmodels.formula.api import ols
# Create OLS dataframe ols_data = pd.concat([X_train, y_train], axis = 1) # Create OLS object and fit the model OLS = ols(formula = ols_formula, data = ols_data) model = OLS.fit()
Get the model summary. Now we need to interpret the results.
# Get model results model.summary()
Here is a screenshot of the results.
Now we can interpret and evaluate the model. In the upper part of the table, we get several summary statistics. We’ll focus on R-squared, which tells us how much variation in body mass (g) is explained by the model. An R-squared of 0.85 is reasonably high, and this means that 85% of the variation in body mass (g) is explained by the model. In the lower half of the table, we get the beta coefficients estimated by the model and their corresponding 95% confidence intervals and p-values. Based on the p-value column, labeled P>|t|, we can tell that all of the X variables are statistically significant, since the p-value is less than 0.05 for every X variable.
C(gender) – Male. Given the name of the variable, we know that the variable was encoded as Male = 1, Female =0. This means that female penguins are the reference point. If all other variables are constant, then we would expect a male penguin’s body mass to be about 528.95 grams more than a female penguin’s body mass.
C(species) – Chinstrap and Gentoo. Given the names of these two variables, we know that Adelie penguins are the reference point. So, if we compare an Adelie penguin and a Chinstrap penguin, who have the same characteristics except their species, we would expect the Chinstrap penguin to have a body mass of about 285.39 grams less than the Adelie penguin. If we compare an Adelie penguin and a Gentoo penguin, who have the same characteristics except their species, we would expect the Gentoo penguin to have a body mass of about 1,081.62 grams more than the Adelie penguin.
Bill length (mm). Lastly, bill length (mm) is a continuous variable, so if we compare two penguins who have the same characteristics, except one penguin’s bill is 1 millimeter longer, we would expect the penguin with the longer bill to have 35.55 grams more body mass than the penguin with the shorter bill.