Regression Analysis


This entry is part 1 of 1 in the series Regression
  • Regression Analysis

Regression analysis or regression models are a group of statistical techniques that use existing data to estimate the relationship between a single dependent variable and one or more independent variables. There is one dependent variable and one or more dependent variables.

Regression analysis attempts to explain the influence that a set of variables has on the outcome of another variable of interest. A regression analysis will help provide a picture of if and by how much variables impact an outcome (dependent) variable.

In statistics, regression analysis attempts to explain the influence that a set of independent variables has on the outcome of another variable of interest. The outcome variable is called a dependent variable because the outcome depends on the other variables. These additional variables are sometimes called the input variables or the independent variables.

Regression analysis is used whenever we have a relationship between two or more variables. We start learning with a simple linear regression model and end with the multivariate regression model. We also want to look at logistic regression. Linear regression is a linear approximation of a relationship between two or more variables.

Use Cases

What is a person’s expected income based on their age and level of education? Linear regression is a tool that can answer this question. What if the question you are asking involves a probability? Suppose the question is what is the chance that a person will default on their loan? Logistic regression is a method we can use to answer this question. You might want to know how the price of a diamond varies. You might be interested in the body mass of a penguin.

Regression analysis is a useful explanatory tool that can identify the input variables that have the greatest statistical influence on the outcome.

Suppose we have an association between two variables. Regression quantifies the nature of the relationship, whereas correlation measures the strength of the association.

How do we identify a correlation between two variables? We can plot them and try to find a pattern, or we would probably prefer using the correlation coefficient. The correlation coefficients are values from minus one to plus one, where zero shows no linear relationship one a very strong positive correlation, and minus one a very strong negative correlation. A positive correlation means that when one goes up, the other goes up and a negative correlation means that when one goes up, the other goes down.

It’s easy to find correlations but much harder to find causation. Remember that correlation does not imply causation.

The Process

  1. Get the appropriate sample data
  2. Inspect and clean the data
  3. Design a model that works for that data
  4. Make predictions for the whole population

Suppose that you are working in Excel. You have two variables. You do a scatter plot and observe the data. You want to add a trendline to the scatterplot you’ve created. You notice that there are the following types: exponential, linear, logarithmic, polynomial, power and moving average. Many data analysts start with the most simple one: linear. What do they try next?

Leave a Reply