Series of Posts
This post is the first part of a series of posts called Statistics Confidence Intervals that you can see above. The next series is called hypothesis testing. The previous series of posts was inferential statistics.
A confidence interval is the range within which you expect the population parameter to be. And, its estimation is based on the data we have in our sample.
CI Calculation
When we get into calculating the confidence interval later on, there are two main situations.
- The population Variance is known (z)
- The population variance in not known (t)
Before discussing confidence intervals, we’ll look at samples and point estimates.
Typically it is too expensive or time consuming to measure an entire population. You will take a representative sample and measure the sample. Suppose you want to know the mean of a population. A point estimate can provide a general idea of a population parameter, but estimates usually include some error due to sampling variability. Data professionals use confidence intervals to describe the uncertainty of an estimate.
Estimator
What is an estimator of a population parameter? It is an approximation depending solely on sample information. A specific value is called an estimate. There are two types of estimates – point estimates and confidence interval estimates. A point estimate is a single number, while a confidence interval naturally is an interval (or range). The point estimate is located exactly in the middle of the confidence interval. However, confidence intervals provide much more information and are preferred when making inferences. Data professionals use confidence intervals to help describe the uncertainty surrounding an estimate.
The sample mean, x bar (x̄), is a point estimate of the population mean mu (μ). The sample variance S squared was an estimate of the population variance: sigma squared (σ2). A point estimate is a type of statistic.
Point estimators are not very reliable. They can be way off the mark. The sample could be very bias or way too small, which would result in a very inaccurate point estimator. To the rescue are confidence intervals.
Univariate
Here we are only concerned with one variable. In terms of a dataset, we are looking at the values in a single column. We’ve got a sample and we need to estimate what the mean value would be for the population.
Confidence Intervals
As an example, suppose we went out and took a sample of the price of a Widget in New York City. Suppose we calculated that the mean price was $23.50. Our point estimate is $23.50.
Suppose we want to be 95% confident that the cost of a widget is between a certain range that lies around the sample mean. A interval is a more accurate estimate of reality (population).
Confidence Level
The level of confidence is denoted by: 1 minus alpha (α) and is called the confidence level of the interval. Alpha is between zero and one. 0 <= α <= 1.
A confidence interval is the range within which you expect the population parameter to be. Its estimation is based on the data we have in our sample. It’s calculated in two different ways: one where you know the population variance and one where you don’t. We very rarely actually have population data, so we’ll normally be using the second calculation. A 95% confidence interval means you are sure that in 95% of the cases, the true population parameter would fall into the specified interval. 95% is 19 times out of 20.
The formula for all confidence intervals is: from the point estimate minus the reliability factor times the standard error, to the point estimate plus the reliability factor times the standard error.
Steps
- Identify a sample statistic
- Choose a confidence level
- Find the margin of error
- Calculate the interval
Example
Let’s look at an example that’s based on the Udemy.com course called Statistics for Data Science and Business Analysis. Let’s say we are analyzing annual salaries for data scientists in the United States in the year 2022. Imagine you have certain information that the population standard deviation of data science salaries is equal to $15,000. Normally you wouldn’t know this. Furthermore, you know the salaries are normally distributed and your sample consists of 30 salaries.
The reliability factor is z of alpha divided by 2. The alpha is the confidence level alpha. So, for a confidence level of 95%, alpha would be equal to 5%. For a 99% confidence, alpha would be 1%.
Above is the formula that gives us the interval.