The describe() method can be used on a pandas DataFrame. describe() returns descriptive statistics of only columns of numbers. If the DataFrame contains numerical data, the description contains these information for each column. Percentiles are values that divide a set of observations into 100 equal parts. The 75th Percentile, also known as the third, or upper, quartile. The 75th percentile is the value at which 25% of the numbers lie above that value and 75% of the numbers lie below that value. Percentiles start at the bottom and go up. If you are ranking students and you are in the 99th percentile of marks, you are in the top 1 percent. Describe() is useful because it gives a variety of key stats all at once. We can use the mean and median to detect outliers. If they are not similar, we might have outliers that are influencing the mean. You can take the max() minus the Min() to calculate the range.
import pandas as pd data = {'one': [1,2,3,4,5], 'two': [10, 12, 14, 16,18], 'chars': ['abc', 'def', 'hij', 'klm', 'nop']} df = pd.DataFrame(data) df
df.describe()
You can also use describe(include=’all’)
describe() excludes missing values (NaN) in the dataset from consideration. Dealing with missing values is not a simple issue to deal with. We have a post here called Missing Data in Pandas.
For a categorical column, describe() gives you the following output:
- count: Number of non-NA/null observations
- unique: Number of unique values
- top: The most common value (the mode)
- freq: The frequency of the most common value
How do you calculate the mean value of a column in a DataFrame? Suppose you DataFrame was called df, and suppose you column was called col_name. After importing NumPy as np, use this code: np.mean(df[“col_name”]). In a similar way you can calculate the median, min, max and std. The mean, median, minimum, maximum, and standard deviation functions from NumPy are useful for finding individual statistics about numerical data.
By default, the numpy library uses 0 as the Delta Degrees of Freedom, while pandas library uses 1. To get the same value for standard deviation using either library, specify the ddof parameter to 1 when calculating standard deviation. np.std(df[“col_name”], ddof=1)
# use Set the include parameter passed in to this function to 'all' # to specify that all columns of the input be included in the output. df.describe(include='all')
Single Column
You can also use the describe function on a single column/feature of your DataFrame. You can calculate the maximum and the standard deviation of a single column. Here is some code below, assuming your DataFrame was called df and the column was called my_col.
df["my_col"].describe() np.min(df["my_col"]) np.max(df["my_col"]) np.std(df["my_col"], ddof=1)
Functions in the pandas and numpy libraries can be used to find statistics that describe a dataset. The describe() function from pandas generates a table of descriptive statistics about numerical or categorical columns. The mean(), median(), min(), max(), and std() functions from numpy are useful for finding individual statistics about numerical data.