Series of Posts
This post is the first part of a short series of posts called Statistics. You can see the list of posts in this series at the top of the post. The next series is called Inferential Statistics. After that series we discuss Confidence Intervals. After that we have Hypothesis Testing.
What are statistics? Statistics is a mathematical science pertaining to the collection, tabulation, classification, analysis, and explanation of quantitative data. The analysis and explanation of this quantitative data may involve making predictions, forecasts, and drawing conclusions. Statistics is a discipline under the Data Science umbrella.
This discussion is just a brief introduction to statistics. The context of this statistics discussion is working with data in a business or similar setting. You might be working with financial data or social data. You might be working with geographic data or product data or health data. The context of this discussion is also in the designing of tables and graphs to be presented to decision-makers.
See the Forest and the Trees
The study of statistics starts off fairly easily but builds over time. You need to know the basics very well before you understand more advanced topics, such as inferential statistics and hypothesis testing. Sometimes I lose sight of the forest when I’m looking at the tree in front of me and I find myself asking when do I use this formula. Under what conditions will this formula work? Persistence matters here. Also, the subject of statistics is only understood when practiced.
Key Concepts
The fundamentals of statistics are critically important. They include, but are not limited to basic math, populations and samples, probability, types of data (categorical and numerical), discrete and continuous numerical data, basic charts and graphs, histograms, measures of central tendency (mean, median, mode), measures asymmetry (skewness), measures of variability (variance, standard deviation and coefficient of variation) and so on.
Probability
I recommend studying the basics of probability, if you’ve never studied probability, before going through this series of posts on statistics.
Descriptive vs. Inferential Statistics
Descriptive statistics involves the collection, organization, summarization, and presentation of data. It describes the data by summarizing it. It works with data in the past. A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or summarizes features from a collection of information. Descriptive statistics describe or summarize the main features of a dataset. They can be used to quickly understand a large amount of data. For example, you could use descriptive statistics to find the mean or average height of a group of people.
Inferential statistics involves generalizing from samples to populations using probabilities. Descriptive statistics focusses on describing the contents of the sample. There are two forms of descriptive statistics
- Visuals (graphs and tables)
- Summary Statistics – two main types: central tendency and dispersion
Inferential statistics is a more advanced topic that includes performing hypothesis testing, determining relationships between variables, and making predictions. It deals with the future. An inference is a conclusion reached on the basis of evidence and reasoning. Inferential statistics aims at making predictions about the population.
Descriptive Statistics
We can classify data in two ways: Type and Measurement Level. We have two types of data: categorical and numerical (or quantitative). Categorical data describes groups or classifications. This includes answers to Yes and No questions. Numerical data is subdivided into discrete and continuous. An inference is a conclusion reached on the basis of evidence and reasoning.
Learn with YouTube
Here’s a really good channel on statistics called StatQuest by Josh Starmer. He also covers machine learning and Data Science. Inside that there is a section on Statistics Fundamentals.
Here is a video on statistics called Statistics made easy ! ! ! Learn about the t-test, the chi square test, the p value and more. It does move along quickly however.
Here is a video called Teach me STATISTICS in half an hour! Seriously by zedstatistics. This one is not heavy on formulas, just understanding data types, distributions, sampling, hypothesis testing and p values. A good introduction.
Here’s a YouTube video called Data Analysis: How Much STATISTICS Do You Need to Know?. Its by Thu Vu data analytics and it’s just under 14 minutes long.
Ace Statistics Interviews: A Data-driven Approach For Data Scientists by Emma Ding. The top ones are: p-value, assumptions of linear regression, t-test, correlation coefficient, types of errors, z-test, central limit theorem, skewed distribution, power analysis, power, Simpson’s paradox, R squared, confidence interval, and so on.
Here’s a video called Practical Statistics for Data Scientists – Chapter 1 – Exploratory Data Analysis. The video is by by Shashank Kalanithi.
Learn with Books
Moden Statistics with R is an online book.
Practical Statistics for Data Scientists, Second Edition, published by O’Reilly in May 2020.
Learn with Books
You can get access to a free online book on statistics. It’s called Introduction to Modern Statistics. It’s by Mine Çetinkaya-Rundel and Johanna Hardin.