Intro to Statistics and Data Science

A Statistical Background

A.1 Basic statistical terms

A.1.1 Mean

The mean, also known as (AKA) the average, is the most commonly reported measure of center. It is commonly called the "average" though this term can be a little ambiguous. The mean is the sum of all of the data elements divided by how many elements there are. If we have \(n\) data points, the mean is given by: \[\overline{x} = \frac{x_1 + x_2 + \cdots + x_n}{n}\]

Note that you will often see shorthand notation for a sum of numbers using \(\sum\) notation. For example, we could rewrite the formula for \(\bar{x}\) as:

\[\bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n}=\frac{\sum_{i = 1}^n x_i}{n}\]

We've simply replaced the subscript on each \(x\) with a generic index \(i\), and use the \(\sum\) notation to indicate we are summing \(x\)s with indices (i.e. subscripts) that go from 1 to \(n\). When summing numbers in statistics, we're almost always dealing with indices that start with the value 1 and go up to a value equal to a sample size (e.g. \(n\)), so often you will see an even more shorthand version, where it's assumed you're summing from \(i = 1\) to \(i = n\):

\[\bar{x} = \frac{\sum x_i}{n}\]

A.1.2 Median

The median is calculated by first sorting a variable's data from smallest to largest. After sorting the data, the middle element in the list is the median. If the middle falls between two values, then the median is the mean of those two values.

A.1.3 Standard deviation

We will next discuss the standard deviation of a sample dataset pertaining to one variable. The formula can be a little intimidating at first but it is important to remember that it is essentially a measure of how far to expect a given data value is from its mean:

\[Standard \, deviation = \sqrt{\frac{(x_1 - \overline{x})^2 + (x_2 - \overline{x})^2 + \cdots + (x_n - \overline{x})^2}{n - 1}} = \sqrt{\frac{\sum_{i= 1}^n(x_i - \overline{x})^2 }{n - 1}}\]

A.1.4 Five-number summary

The five-number summary consists of five values: minimum, first quantile AKA 25th percentile, second quantile AKA median AKA 50th percentile, third quantile AKA 75th, and maximum. The quantiles are calculated as

  • first quantile (\(Q_1\)): the median of the first half of the sorted data
  • third quantile (\(Q_3\)): the median of the second half of the sorted data

The interquartile range is defined as \(Q_3 - Q_1\) and is a measure of how spread out the middle 50% of values is. The five-number summary is not influenced by the presence of outliers in the ways that the mean and standard deviation are. It is, thus, recommended for skewed datasets.

A.1.5 Distribution

The distribution of a variable/dataset corresponds to generalizing patterns in the dataset. It often shows how frequently elements in the dataset appear. It shows how the data varies and gives some information about where a typical element in the data might fall. Distributions are most easily seen through data visualization.

A.1.6 Outliers

Outliers correspond to values in the dataset that fall far outside the range of "ordinary" values. In regards to a boxplot (by default), they correspond to values below \(Q_1 - (1.5 * IQR)\) or above \(Q_3 + (1.5 * IQR)\).

Note that these terms (aside from Distribution) only apply to quantitative variables.

A.2 Normal distribution discussion