Lecture 1: Crash course in statistics¶
We try to avoid frustrating you with too much probability and statistics. Concentrate to get the basics, so we can quickly get to the more fun stuff! What you need to absolutely know is in this notebook:
Mean (moyenne), range, variance, standard deviation (écart type)
Median, quartiles, percentiles (or quantiles)
Scatter plots, covariance and correlation
Random variable, Histograms, Probability Density Function (pdf)
Pair plots
Normal distribution and confidence intervals (error bars)
Variance of the mean, standard error, numerical precision, and sample size
import numpy as np # Library useful to do matrix algebra
import pandas as pd # Library useful to manipulate data tables (data frames) and visualize data
import seaborn as sns; sns.set() # Note useful data visualization functions
from scipy.stats import bernoulli# Errors made are like coin flips, the follow a Bernoulli distribution
import matplotlib.pyplot as plt # Library to make graphs
import matplotlib.pylab as pylab # Set font sizes as desired
import matplotlib.ticker as ticker
params = {'legend.fontsize': 'xlarge',
'figure.figsize': (20, 10),
'axes.labelsize': 'xxlarge',
'axes.titlesize':'xxlarge',
'xtick.labelsize':'xxlarge',
'ytick.labelsize':'xxlarge'}
pylab.rcParams.update(params)
# Command to insert the graphs in line in the notebook:
%matplotlib inline
from platform import python_version
print(python_version()) # Check which version of Python you are running
0. Load and visualize data¶
To get some concrete real example, we weighted 10 apples and 10 bananas (in grams). We also measured the length of the 10 bananas (in cm). Note that for a given index, there is no relation between the apple weight and the banana weight, but the banana weight and length are measured for the same banana.
We load the data as a Pandas dataframe.
df = pd.read_csv("AppleBanana.csv")
We use a nice feature of Pandas that allows us to visualize tables as a heat map. It appears immediately visually that AppleWeight and BananaWeight are not correlated (as expected), but BananaWeight and BananaLength seem to be correlated. Another observation is that there are two clusters of 5 apples, the first one is be composed of lighter apples than the second one.
df.style.background_gradient(cmap='Blues')
If we use the function describe()
of Pandas, we get a bunch of statistics, le us see whether we understand what those are.
df.describe()
1. Mean, variance, and standard deviation¶
We are going the use the banana weights to illustrate the notions of Mean (moyenne), variance, and standard deviation (écart type).
First comes first: we VISUALIZE the weights of the bananas on a line.
def show_series(df, series='BananaWeight'):
SortedBananaWeight = df[series].sort_values()
fig = plt.figure(figsize=(20, 1))
plt.plot(SortedBananaWeight, np.ones(len(SortedBananaWeight)), 'go')
plt.xlabel('%s' % series)
ax = plt.gca()
ax.axes.yaxis.set_visible(False)
tick_spacing = 1
ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
show_series(df)
Mean:¶
MEAN: Sum of values divided by number of values.
In mathematical notations, if the weights of the bananas are denoted: \(x_1 = 149, x_2 = 154, \cdots, x_n = 151\), the mean will be:
\(\mu = \frac{1}{10} (x_1 + x_2 + \cdots + x_n) ~~= ~~\frac{1}{n}\sum_{i=1}^n x_i \)
that is:
\(\mu = \frac{1}{10} (149 + 154 + \cdots + 151)\)
Let us check that and verify the we obtain the same value as provided by df.describe()
.
mu = (1/10) * df['BananaWeight'].sum()
assert(mu== df['BananaWeight'].mean())
mu
We can now place the mean as a red vertical bar in the plot we made previously.
show_series(df)
plt.axvline(mu, alpha = 1, ymax = 1, linestyle = '', color='red')
plt.title("Mean = red vertical bar")
Range, variance and standard deviation¶
We now need a measure that summarizes the dispersion of points (voir indicateurs de dispersion). We could use the range (en français: étendue). The range is define as the difference between the maximum (largest) and the minimum (smallest) values:
rng\(= \max_i(x_i)  \min_i(x_i) = 166142 = 24\)
rng = df['BananaWeight'].max()df['BananaWeight'].min()
rng
Another possibility would be to compute the mean absolute deviation (MAD) (en français: écart moyen) defined as:
mad\( = \frac{1}{n} \sum_{i=1}^n x_i  \mu\)
that is the average of the distances between the \(b_i\) values and the mean \(\mu\). Notice that if you omitted the absolute value, you would get: \(\frac{1}{n} \sum_{i=1}^n (x_i  \mu) = \frac{1}{n} \sum_{i=1}^n x_i  \frac{1}{n} n \mu = \mu  \mu = 0 \), which is not very useful.
mad = df['BananaWeight'].mad()
mad
But more commonly, people use the standard deviation (std ou stdev) (en français: écart type), which is the square root of the variance (en français: variance). The variance is defined as:
VARIANCE: Mean Square Distance to the Mean.
VARIANCE (Français): La moyenne des carrés des écarts à la moyenne.
var\(~ = \sigma^2 = \frac{1}{n} \sum_{i=1}^n (x_i  \mu)^2\).
So the only difference between mad\(_i\) and var\(_b\) is that we SQUARE the differences \((x_i  \mu)\) instead of taking their absolute values. Then to get the standard deviation, on just needs to take the square root:
STANDARD DEVIATION: Root Mean Square Distance to the Mean.
ECART TYPE (Français): La racine carrée de la moyenne des carrés des écarts à la moyenne.
std\(~ = \sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i  \mu)^2}\).
var = df['BananaWeight'].var()
var
std = df['BananaWeight'].std()
std
assert(std==np.sqrt(var))
Let us plot now as dashed red vertical bars \((\mu  \sigma)\) and \((\mu + \sigma)\):
def show_mean_std(df, series = 'BananaWeight', two_sigma=False):
dfb = df[series]
mu = dfb.mean()
std = dfb.std()
plt.axvline(mu, alpha = 1, ymax = 1, linestyle = '', color='red')
plt.axvline(mustd, alpha = 1, ymax = 1, linestyle = '', color='red')
plt.axvline(mu+std, alpha = 1, ymax = 1, linestyle = '', color='red')
if two_sigma:
plt.axvline(mu2*std, alpha = 1, ymax = 1, linestyle = '', color='blue')
plt.axvline(mu+2*std, alpha = 1, ymax = 1, linestyle = '', color='blue')
show_series(df)
show_mean_std(df)
plt.title("(mean + std) and (mean  std) = red dashed vertical bars")
2. Median, quartiles, percentiles (or quantiles)¶
So far we have explained in df.describe()
mean and std. We also found that min and max are the smallest and largest values (respectively) and that they are used to compute the range rng=maxmin.
Next we need to explain the lines marqued as 25%, 50%, and 75%. These are called percentiles.
\(P^{th}\) PERCENTILE: Threshold below which P% of the values are found.
Percentiles are expressed in percent (beween 0 and 100) while quantiles are expressed in probability values (between 0 and 1); they are otherwise the same notion:
\(P^{th}\) QUANTILE: Threshold below which a fraction P of the values are found.
There are some special cases pf percentiles that have particular names:
LOWER QUARTILE: \(25^{th}\) percentile.
MEDIAN = MIDDLE QUARTILE: \(50^{th}\) percentile.
UPPER QUARTILE: \(75^{th}\) percentile.
The median splits the examples in two halves with the same number of examples on either side. The quartiles split the examples in 4 equal parts.
df.describe()
Let us plot now:
as a solid green vertical bars the median,
as dashed green vertical bars the upper and lower quartiles,
as dashed black vertical bars the \(5^{th}\) and \(95^{th}\) percentiles.
def show_percentiles(df, series = 'BananaWeight', tail=0.10):
dfb = df[series]
# Calculate percentiles
quant_L, quant_25, quant_50, quant_75, quant_U = dfb.quantile(tail), dfb.quantile(0.25), dfb.quantile(0.5), dfb.quantile(0.75), dfb.quantile(1tail)
print('QL=%5.2f, Q25=%5.2f, Q50=%5.2f, Q75=%5.2f, QU=%5.2f,' % (quant_L, quant_25, quant_50, quant_75, quant_U))
# Plot the green lines
plt.axvline(quant_25, alpha = 1, ymax = 1, linestyle = '', color='green')
plt.axvline(quant_50, alpha = 1, ymax = 1, linestyle = '', color='green')
plt.axvline(quant_75, alpha = 1, ymax = 1, linestyle = '', color='green')
# Plot the black lines
plt.axvline(quant_L, alpha = 1, ymax = 1, linestyle = '', color='black')
plt.axvline(quant_U, alpha = 1, ymax = 1, linestyle = '', color='black')
show_series(df)
show_percentiles(df)
plt.title("Median = solid green. Quartiles = dashed green. 10th and 90th percentiles = dashed black.")
First we can check that the potition of the 25, 50, and 75 percentiles in describe()
coincide with those we found using quantile()
. Second, we can be surprised of where the percentile threshold lines end up. If you read the Pandas documentation on quantiles, this is due to the fact that, by default, some linear interpolation is performed. You can play with the various options to see how the position of the vertical lines is affected.
Let us now show mean and std and percentiles on top of each other.
show_series(df)
show_mean_std(df)
show_percentiles(df)
plt.title("Mean, std and percentiles overlayed.")
3. Scatter plots, covariance, and correlation¶
In a data frame, a column is called a “series”. It can be interesting to plot one series vs. another. Let us plot BananaLength vs. BananaWeight, and compare it with AppleWeight vs. BananaWeight.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 7))
df.plot(x='BananaWeight', y='BananaLength', kind="scatter", ax=ax1, color='g', s = 50)
df.plot(x='BananaWeight', y='AppleWeight', kind="scatter", ax=ax2, color='r', s = 50)
In the left plot, the points do no look super well aligned, still there seems a trend: the longer the banana, the heavier. In the right plot, as can be expected, there is no trend. We can actually QUANTIFY that by measuring how the series covary (vary together):
Covariance¶
Remember, we defined variance as the Mean Square Distance to the Mean, i.e. var\( = \sigma_x^2 = \frac{1}{n} \sum_{i=1}^n (x_i  \mu_x)^2\).
By analogy with the variance, we define the covariance as:
COVARIANCE: Mean Joint Product of Distance to the Mean.
So given two series \(\{x_i\}\) and \(\{y_i\}\), the covariance is defined as:
\(\sigma_{xy} = \frac{1}{n} \sum_{i=1}^n (x_i  \mu_x)(y_i  \mu_y)\).
We show below the covariance matrix, i.e. the matrix of all covariances of pairs of series in a data frame. On the diagonal, we have \(\sigma^2\), the variance. Off diagonal, we have the covariances \(\sigma_{xy}\). Check that the variance of BananaWeight is the same as found above.
df.cov()
WARNING: The magnitude of the covariance coefficients are not easy to interpret, because they are not normalized, and hence they depend on the magnitudes of the variables (which can be expressed in different units, e.g. the length could be expressed in mm instead of cm and the weight in kg instead of g). Here we see for example that covariance(AppleWeight, BananaWeight) > covariance(BananaLength, BananaWeight), which is unexpected since AppleWeight and BananaWeight are NOT related, whereas BananaLength and BananaWeight ARE related!
Correlation¶
The Pearson correlation coefficient is a normalized version of the covariance (so it solves the above problem of scale).
CORRELATION: Covariance normalized by product of standard deviations.
Mathematically:
\(\rho_{xy} = \frac{\sigma_{xy}}{\sigma_x \sigma_y}\)
The magnitude of \(\rho_{xy}\) represents the strength of the linear relation between the two series. Its values are between 1 and 1. Zero means no correlation, 1 means perfect correlation (points aligned, positive slope) and 1 means perfect anticorrelation (points aligned, negative slope).
The correlation matrix is the matrix of all correlations of pairs of series in a data frame. On the diagonal, we have \(\rho_{xx}\) values, which are always 1 (as expected from the definition of the correlation coefficient).
df.corr()
We are now reassured: we see that correlation(AppleWeight, BananaWeight) < correlation(BananaLength, BananaWeight), which is what we expected since AppleWeight and BananaWeight are NOT related, whereas BananaLength and BananaWeight ARE related.
To visualize the correlation matrix, we can tint the boxes and turn it into a heat map:
df.corr().style.background_gradient(cmap='Blues')
Unfortunately, as can be seen, the coloring of boxes for the same numerical value is different in the different columns. This is due to the fact that Pandas spans the color map between the smallest and the largest value in each column. To overcome this problem, we can visualize the correlation matrix with Seaborn:
sns.heatmap(df.corr(), fmt='0.2f', annot=True, square=True)
4. Random variables, histograms, and PDF¶
Consider a quantity like BananaWeight. We can consider the production of bananas as a random process. Given a very large number of banana, if we pick one at random, it will have a given weight. Each banana has a slightly different weight. We call BananaWeight a random variable (variable aléatoire). Random variables are often denoted with capital letters, such as \(X\), and particular measurements or observations (called “realizations”) are denoted with small letters, such as \(x\).
If we draw a very large number of bananas, the plots we made in section 1 (dots aligned on one axis), will become impossible to read (the dots will fall on one another). We will need another means of visualizing data. We are going to introduce histograms.
HISTOGRAM: To make a histogram, we split the xaxis in class intervals (called bins or buckets). We then make diagram consisting of rectangles whose area is proportional to the frequency of occurences in the interval and whose width is equal to the class interval.
This is a way of representing as a bar graph Proba(\(X \in bin\)). If the number of examples grows to infinity and, simultaneously, the width of the bins goes to zero, a histogram becomes a density \(\rho(\)X = x\()\).
Grades info 111¶
As a first real life example, let us make histgrams of the grades of info 111. We replaced missing values by 0.
gdf = pd.read_csv("GradesInfo111.csv")
gdf.describe()
gdf.style.background_gradient(cmap='Blues')
col = gdf.columns[2]
sns.heatmap(gdf.sort_values(by=col).transpose())
Let us check one of the series, that of the third column:
show_series(gdf, series=col)
show_mean_std(gdf, series=col)
show_percentiles(gdf, series=col)
As can be seen, many points fall on top of one another. Hence it is difficult to visualize the distribution of grades in this way. Let us make a histogram.
gdf[col].hist()
plt.xlabel(col)
plt.ylabel('Counts')
show_mean_std(gdf, series = col)
show_percentiles(gdf, series = col)
plt.title("Mean = solid red. Mean + std = dashed red. Median = solid green. Quartiles = dashed green. 10th and 90th percentiles = dashed black.")
Now it is easier to see that there is a higher density of good grades, except for a cluster of people at very low grade (probably people who did not show up at the exam).
Multiple histograms¶
Pandas allows you to easily draw simultaneously the histograms of all series.
gdf.hist(figsize=(20, 20))
5. Pair plots¶
When we have several series, it is convenient to show simultaneously histograms and scatter plots. Be patient, this is SLOW. On the diagonal you will see the histograms for the various series. Off diagonal you will see the scatter plots of all pairs of series.
sns.pairplot(gdf)
As you can see, several series are very correlated (points nearly aligned). We can quantify that with the correlation matrix.
sns.heatmap(gdf.corr(), fmt='0.2f', annot=True, square=True)
6. Normal Law and confidence intervals¶
Normal “Law” (“Loi” Normale)¶
Let us go back to our banana example. We were too lazy to weigh a large number of bananas. So we decided to simulate drawing 1000 examples of BananaWeight using the Normal law, but keeping the mean and standard deviation of our (small) real sample. Formally, we say the BananaWeight is a random variable distributed according to the Normal distribution or Gaussian distribution.
def draw_normal_bananas(df, series = 'BananaWeight', num=1000):
# We draw num examples using the Normal law, using the mean and std of df[series]
# By fixing the ramdom seed, we get the same sample every time we run the code
np.random.seed(seed=1234) # scipy.stats uses numpy.random to generate its random numbers
mu, sigma = df[series].mean(), df[series].std() # mean and standard deviation
s = np.random.normal(mu, sigma, num) # draw 1000 bananas
count, bins, ignored = plt.hist(s, 30, density=True) # make the histogram
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * # plot the Normal density
np.exp(  (bins  mu)**2 / (2 * sigma**2) ),
linewidth=2, color='r')
plt.xlabel(series)
plt.ylabel('Density')
return pd.DataFrame(s, columns = [series])
draw_normal_bananas(df)
The red line represents the density of bananas having a given weight. The blue bars represent the histogram obtained for 1000 bananas drawn randomly according the a Normal distribution with mean and standard deviation that are given by our emplirical sample of 10 bananas, used before.
The height of the bars is proportional to the number of bananas that fall in the interval of weight defined by the width of the bar.
Mean, median, stdev, quartiles, and other quantiles, or percentiles¶
As before, we can plot as vertical lines the mean variance the mean, median, stdev, quartiles, and other quantiles, or percentiles. But we chose different quantiles for the black dashed lines. The first one uses P=0.025=2.5% and the last one P=0.975=97.5%. In this way, we get 95% of the probability mass between the two black dashed lines.
df_normal = draw_normal_bananas(df)
show_mean_std(df_normal, two_sigma=True)
show_percentiles(df_normal, tail=0.025)
For the Normal distribution:
The mean and the median coincide.
We have beween \(\mu  \sigma\) and \(\mu  \sigma\) (red dashed lines) 68% of the probability mass. The is more than between the quartiles (green dashed lines): 50% of the probability mass, by definition.
We have beween \(\mu  2\sigma\) and \(\mu  2\sigma\) (blue dashed lines) 95% of the probability mass (this is verified by showing as black dashed lines the 2.5 percentile).
The Normal distribution is completely characterized by only two parameters: \(\mu\) and \(\sigma\) (sufficient statistics).
Confidence intervals¶
If we want to give a an idea of a typical BananaWeight, we can just provide the mean (point estimation), or we can give an interval in which it is probable to find the weight of a typical banana.
For the Normal distribution, we have:
Proba\((\mu\sigma<BananaWeight<\mu+\sigma)\) = 0.68
Proba\((\mu2\sigma<BananaWeight<\mu+2\sigma)\) = 0.95
The latter is called a 95% confidence interval or 95% CI. People often report results as \(\mu \pm \sigma\). This implicitly means a 68% CI.
Not all distributions are symmetric like the Normal distribution. the bounds of a 95% CI are not necessarily placed on either side of the mean (or median). But in this class, for all practical purposes, you will be permitted to assume that \(\mu \pm 2\sigma\) is a 95% CI.
ERROR BAR: There are multiple definitions of error bars.
When we write  We mean 

Measurement = $\mu \pm \sigma$  A 68% CI, for data Normally distributed. Proba$(\mu\sigma 
Measurement = $\mu \pm 2 \sigma$  A 95% CI, for data Normally distributed. Proba$(\mu2\sigma 
Measurement = $[L, U]$, 95% CI  $L$ is the lower bound and $U$ the upper bound of the 95% CI 
To get values for the three series in our fruit example, while respecting the covariance between the series, we can draw examples using the Normal multivariate distribution, using the covariance matrix of our (small) real data sample.
s = np.random.multivariate_normal(df.mean(), df.cov(), 1000)
df_multi = pd.DataFrame(s, columns = ['AppleWeight', 'BananaWeight', 'BananaLength'])
sns.pairplot(df_multi)
We can then check that the correlation matrix of the articial data resembles that of the real data.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))
sns.heatmap(df_multi.corr(), fmt='0.2f', annot=True, square=True, ax=ax1)
ax1.set_title('Artificial data')
sns.heatmap(df.corr(), fmt='0.2f', annot=True, square=True, ax=ax2)
ax2.set_title('Real data')
6. Variance of the mean, standard error, error bars, numerical precision¶
Often, what we are interested in is not the distribution of a quantity like BananaWeight, but we want to estimate as precisely as possible the true MEAN of the quantity (its mathematical expectation). The larger the sample size n, the more precisely we will estimate the mean. In fact, the variance of the mean is inversely proportional to n.
Variance of the mean and standard error¶
We call \(\mu\) the true mean and \(\hat{\mu} = (1/n) (\sum_{i=1}^n X_i)\) the estimator of the empirical mean (computed with only n examples).
WARNING: We do not always make a distinction between:
the “true” mean of a random variable X \(\mu\), also called mathematical expectation, (which is obtained by averaging an infinite number of examples, drawn according to the distribution of X), and
the empirical mean \(\hat{\mu}\) (which is obtained by averaging an FINITE number of examples).
unless it is necessary because we manipulate both at the same time.
Note that \(\hat{\mu}\) is a random variable and as such it has a variance. We obtain different values of \(\hat{\mu}\) every time we draw randomly (and independently) n values of X over which we average.
VARIANCE OF THE MEAN: var\((\hat{\mu}) = \sigma_{\mu}^2 = \sigma^2/n\)
PROOF: You can skips this proof, but if you are interested, read on, it is not that hard.
Consider n random variables (RV) distributed identically, but independent of one another: \(X_1, X_2 \cdots X_n\), of mean \(\mu\) and of variance \(\sigma\). The variance of the sum will be the sum of the variances (search for “Sum of uncorrelated variables” in the Wikipedia page on variance):
\(var(\sum_{i=1}^n X_i) = \sum_{i=1}^n var(X_i) = n \sigma^2\).
The empirical mean is
\(\hat{\mu} = (1/n) (\sum_{i=1}^n X_i)\).
We put a hat on top of the empirical mean to distiguish it from the true mean \(\mu\). We also know that when a RV is scaled by a constant \(a\), its variance is multiplied by \(a^2\) (search for “Basic properties” in the Wikipedia page on variance). So:
var\((\hat{\mu})=(1/n)^2 var(\sum_{i=1}^n X_i) = (1/n)^2 n \sigma^2 = \sigma^2/n. \)
Therefore the variance of the mean is: var\((\hat{\mu}) = \sigma_{\mu}^2 = \sigma^2/n\)
The standard deviation of the mean (also known as standard error) is obtained by taking the square root of the variance of the mean:
STANDARD ERROR: \(\sigma_{\mu} = \sigma/\sqrt{n}\)
The law of large numbers, stipulates that as \(n\) goes to infinity the empirical mean \(\hat{\mu}\) converges to the “mathematical expectation” \(\mu\), and the variance of the mean goes to zero. Taking as example the Normal law and the 95% confidence interval we saw above, we easily see that:
Proba\((\mu  \sigma_{\mu} < \hat{\mu} < \mu  \sigma_{\mu}) = 0.95\)
Proba\((\mu  \sigma/\sqrt{n} < \hat{\mu} < \mu  \sigma/\sqrt{n}) = 0.95\)
Hence \(\hat{\mu}\) converges to \(\mu\) .
VERIFICATION: We now proceed to a simple empirical verification. We draw \(b\) samples of \(n\) values according to the Normal distribution (each one is called a “batch” or “block”). We compute the means \(\mu_i\) of each of these \(b\) batches, \(b = 1 \cdots n\). Then we compute the empirical variance \(\sigma_{\mu}\) of the means \(\mu_i\), over the \(b\) batches, that is \(\hat{\sigma}_{\mu}^2 = (1/b)\sum_{i=1}^b (\mu_i  \hat{\mu})^2\), where \(\hat{\mu} = (1/b)\sum_{i=1}^b \mu_i\). We added a hat on \(\sigma_{\mu}^2\) and \(\mu\) to remind ourselves that this is only an approximation evaluated on \(b\) batches, but if we make \(b\) large it should be good. If the theory is correct, then \(\hat{\sigma}_{\mu}^2\) should be inversely proportional to \(n\). This is easy to visualize on a loglog plot since \(\log \sigma_{\mu} = \log \sigma  \log n\). Hence, the points should be aligned and the line should have a slope 1 in loglog coordinates.
# For concreteness, we choose as mean and variance that of our empirical sample of banana weights
mu, sigma = df['BananaWeight'].mean(), df['BananaWeight'].std() # mean and standard deviation
# We make a note of the actual sample size
N = len(df['BananaWeight'])
# We make a note of the variance
V = sigma**2
# We select some sample sizes, growing exponentially, to verify the law of large numbers
n = [1, 10,100,1000,10000]
# We select a large batch size
b = 10000
v = []
for num in n:
mu_vec = np.zeros((b,1))
for i in range(b):
# We draw a vector of size num containing fake banana weights
x = np.random.normal(mu, sigma, num)
# We evaluate the mean for this number of bananas (num)
mu_vec[i,0] = x.mean() # We repeat this b times and fill a vector of b mu values
# We compute for each num value the variance of the mu, over the b batches
v = np.append(v, mu_vec.var())
def log_log_plt(n, v, N, V):
# Plot a list of variance values as a function of a list of n values
# Indicate in blue the actual value of N and the variance of mean predicted by the theory
# Assume n[0]=1
plt.plot(n, v)
plt.plot(n, v, 'ro')
plt.plot(N, V/N, 'bo')
ax = plt.gca()
ax.set_xscale('log')
ax.set_yscale('log')
plt.xlabel('Number examples $n$')
plt.ylabel('Variance of mean $\sigma_{\mu}^2$')
plt.text(n[0], v[0], '$\sigma^2_{\mu}=\sigma^2$', fontsize=18, color='r') # coordinate at origin
plt.text(N, V/N, '$\sigma^2_{\mu}=\sigma^2/N$', fontsize=18, color='b') # coordinate of original N value
plt.title('Variation of $\sigma^2_{\mu}$ as a function of $n$, for $\sigma^2=$%.0f and N=%d' % (V, N))
log_log_plt(n, v, N, V)
We verify that the points are well aligned, with slope (1) in the loglog scale, in accordance with:
LAW OF LARGE NUMBERS: \(\log \sigma_{\mu}^2= \log \sigma^2  \log n\).
The intercept (coordonnée à l’origine) is \(\log \sigma^2\). Hence is suffices for us to know (an empirical estimate of) \(\sigma^2\) to draw this entire curve, once we know the “theory”. In what follows we call \(\sigma\) the standard deviation estimated with our empirical sample and we assume that \(\sigma_{\mu}\) is computed with \(\sigma_{\mu} = \sigma/\sqrt{n}\), rather than being empirically estimated with \(b\) samples of the same size \(n\) (which would not make sense, because if we had \(bn\) samples, we would use all of them to get a more precise estimation of the mean.
The base of the log does not matter. Here for convenience we use log\(_{10}\).
Numerical precision and sample size¶
The question now is: How many examples do you need to get a given numerical precision?
There are several ways of defining numerical precision. In this lecture and for the next few weeks, we will define numerical precision as:
NUMERICAL PRECISION: \(\nu = \log_{10}\) snr.
SIGNAL TO NOISE RATIO: snr\( = \frac{\hat{\mu}}{\sigma_{\mu}}\).
WARNING: By doing that, we consider only the statistical error also called random error and ignore possible systematic errors (we will see that in the chapter on bias) and we do not use the standard comupter science definition of numerical precision. Our notion of precision is that used in signal processing and statistics. It is related to the standard deviation \(\sigma\), the signal to noise ratio \(snr= \mu / \sigma\) or its inverse the coefficient of variation \(cv = \sigma / \mu\) (or its square, the efficiency \(eff = (\sigma / \mu)^2\)). Hence: \(\nu = \log_{10} snr =  \log_{10} cv =  (1/2) \log_{10} eff\).
Now, if we use ONEsigma bars (68% condidence intervals for the Normal distribution) we will report results as:
Measurement = \(\hat{\mu} \pm \sigma_{\mu}\)
For example:
Measurement = \(1.00 \pm 0.01\)
In this case we will have \(\hat{\mu}/\sigma_{\mu}=100\), thus \(\nu=2\) and we see that \(\nu=2\) corresponds to the number of decimal places. If we have:
Measurement = \(123.000 \pm 0.123\)
then, \(\hat{\mu}/\sigma_{\mu}=1000\) , thus \(\nu=3\), and we will call that our numerical precision. We will keep only ONE trailing non zero number in the error bar and write the result as:
Measurement = \(123.0 \pm 0.1\)
SIGNIFICANT DIGITS: Significant digits of a calculated result \(\hat{\mu}\) are determined by the largest non zero digit of the uncertainty \(\sigma_{\mu}\), called “error digit”. We only keep a single error digit in the uncertainty when reporting results. The number of significant digits should be identical to the precision \(\nu\) (up to rounding errors) and will be the number of non zero digits above the “error digit”.
Yet another example:
Measurement = \(0.00318123 \pm 0.0005\)
then, \(\hat{\mu}/\sigma_{\mu}=6.36\) , thus \(\nu=0.8\), we will write the result as:
Measurement = \(0.0032 \pm 0.0005\)
We sse that we have one non zero digit in \(\hat{\mu}\) above the error digit, hence one significant digit, consistent with the fact that round(\(\nu\))=1. Note that we rounded the last digit of \(\hat{\mu}\).
FIXING THE NUMERICAL PRECISION: If we now want to IMPOSE a given numerical precision \(\nu = log_{10}\) snr, we can express \(\sigma_{\mu}\) in two different ways:
\(\sigma_{\mu} = \sigma~/~\sqrt{n}\)
\(\sigma_{\mu} = \hat{\mu}~/\) snr \(= \hat{\mu}~/~10^{\nu}\)
Solving for n, we get:
SAMPLE SIZE TO GET A GIVEN snr: \(~~n = \frac{\sigma^2}{\hat{\mu}^2}\) snr\(^2\) .
Or equivalently:
SAMPLE SIZE TO GET A NUMERICAL PRECISION snr: \(~~n = \frac{\sigma^2}{\hat{\mu}^2} 10^{2\nu}\) .
So given empirical estimates of the mean and the variance of the quantity of interest, we can compute the number of examples \(n\) we need to achieve a given precision \(\nu\). Unfortunately, \(n\) increases exponentially with \(\nu\). Note that technically \(n\) should be rounded to get an integer value.
BERNOULLI TRIALS: This formula looks daunting, but we can simplifying in one very useful particular case: If the quantity of interest \(X\) is just coin flips (e.g. 1 for an error and 0 for a correct answer), with a probability \(p\) of heads and \((1p)\) of tail. In that case, \(X\) is known to be distributed according to a Bernoulli distribution for which \(\mu=p\) and \(\sigma=p(1p)\). Using \(\nu=1\) (one significant digit) and \(\hat{\mu}\) as our estimator of \(p\), if \(p \ll 1\) we get the simplified formula:
\(n p \simeq 100\)
which is easy to remember. It means that to get “good” error bars, we need to see at least 100 errors.
GRAPHICAL REPRESENTATION: Let is go back to our graph in loglog coordinates and plot both:
\(\sigma^2_{\mu} = \sigma^2 / n\)
\(\sigma^2_{\mu} = \hat{\mu}^2/10^{2\nu}\).
The intersection of the two will give us the desired numerical precision. In log coordinates, we just need to find the intersection of two lines:
\(\log_{10} \sigma_{\mu}^2= \log_{10} \sigma^2  \log_{10} n\)
\(\log_{10} \sigma_{\mu}^2= \log_{10} \hat{\mu}^2  2 \nu \).
As a numerical application, let us use our original BananaWeight example.
# (1) This part is similar as before, except that we only give the 1st and the last point of the line
# the first one corrsponding to (n=1, sigma^2_mu=sigma^2) and the last (n=n_max, sigma^2_mu=sigma^2/n_max)
# We do not compute sigma^2_mu empirically, we just apply the formula sigma^2/n
# Compute the empirical mean
mu = df['BananaWeight'].mean()
# Compute the empirical variance
V = df['BananaWeight'].var()
# Find original value of n
N = len(df['BananaWeight'])
# Compute variance of the mean for the original sample number N
var_n = V/N
# Compute the original precision
nu_orig = np.round(np.log10(mu/np.sqrt(var_n)))
# Provide the original precision
print('Using N = %d examples, mean(BananaWeight) = %d + %d (nu = %d)' % (N, np.round(mu), np.round(np.sqrt(var_n)), nu_orig))
# Compute variance of the mean for a vary large value of n (n_max)
n_max = 10000
var_n_max = V/n_max
# Make the loglog plot
log_log_plt([1, n_max], [V, var_n_max], N, V)
# (2) This part is new. We want to impose a larger number of significant digits.
# We choose a precision nu and compute the number of examples n_nu necessary to get that precision
# Choose precision
nu = 3 # Three significant digits
# Get var_n_nu = mu^2 / 10^2nu
var_new_n = (mu/10**nu)**2
# Get new n value at intersection
new_n = np.round((var/var_new_n))
# Plot horizontal line at var_n_new
plt.plot([1, new_n], [var_new_n, var_new_n])
print(var_new_n, V/new_n, np.sqrt(var_new_n))
print('Using N*= %d examples, mean(BananaWeight) = %5.1f + %.1f (nu = %d)' % (new_n, mu, np.sqrt(var_new_n), nu))
# Plot vertical bar and show tex for new N value N*
plt.plot([new_n, new_n], [var_new_n, var_n_max], "", color='black')
plt.plot(new_n, var_new_n, 'ko')
plt.text(new_n, var_new_n, '$\sigma^2_{\mu}=\sigma^2/N*$', fontsize=18, color='k') # coordinate of new N value
plt.text(N, var_new_n, '$\sigma^2_{\mu}=\hat{\mu}^2/10^{2\\nu}$', fontsize=18, color='k') # coordinate of new N value
plt.text(new_n/1.5, var_n_max/4, '$N*$', fontsize=18, color='k') # Desired value of N*
NUMBER OF EXAMPLES AS A FUNCTION OF \(\nu\): Taking to log\(_{10}\) of \(~~n = \frac{\sigma^2}{\hat{\mu}^2} 10^{2\nu}\) we get:
\(\log_{10}n = 2\nu + b \) ,
with \(b = \log_{10} \sigma^2  2 \log_{10} \hat{\mu}\). So, given empirical estimates of the mean and the variance of the quantity of interest \(\sigma\) and \(\hat{\mu}\) (for a given number of examples \(N\)), we can easily make an abacus to determine the number of examples that would be needed to obtain any desired precision \(\nu\) on the estimate of the mean.
# Compute the empirical mean
mu = df['BananaWeight'].mean()
# Compute the empirical variance
V = df['BananaWeight'].var()
# Compute b
b = np.log10(V)  2* np.log10(mu)
# Choose first and last point
min_nu = 0
max_nu = 3
min_logn = 2*min_nu +b
max_logn = 2*max_nu +b
min_n = 10**min_logn
max_n = 10**max_logn
# Plot n on a log scale
plt.plot([min_nu, max_nu], [min_n, max_n], "", color='black')
ax = plt.gca()
ax.set_yscale('log')
plt.xlabel('Precision $\\nu$')
plt.ylabel('Number examples $n$')
# Indicate the precision of the original sample
N = len(df['BananaWeight'])
var_n = V/N
nu_orig = np.log10(mu/np.sqrt(var_n))
plt.plot(nu_orig, N, 'bo')
Bottomline: the number of examples increases exponetially with the precision.
Scaling the number of examples with the confidence¶
As indicated before, the size confidence interval depends on the degree of confidence we want to put into it. According to Wikipedia: “In statistics, the 68–95–99.7 rule, also known as the empirical rule, is a shorthand used to remember the percentage of values that lie within a band around the mean in a normal distribution with a width of two, four and six standard deviations, respectively; more precisely, 68.27%, 95.45% and 99.73% of the values lie within one, two and three standard deviations of the mean, respectively.” In mathematical form, this means:
Proba\(( X\mu \leq \sigma ) \simeq 0.6827\)$
Proba\(( X\mu \leq 2 \sigma ) \simeq 0.9545\)
Proba\(( X\mu \leq 3 \sigma ) \simeq 0.9973\)
Even though not all distributions of interest are “normal”, many are nearly normal and it is usual, instead of reporting confidence intervals as:
Proba\(( X\mu \leq \epsilon ) \simeq (1  \delta)\)
where \(\delta\) is a small value, one only considers for \(\epsilon\) multiple values of \(\sigma\): \(1~\sigma\), \(2~\sigma\), \(3~\sigma\), \(\dots k~\sigma\), and one does not even bother reporting the exact corresponding value of \(\delta\).
One question of interest though is how does the number of examples needed to ensure a given confidence interval grow with \(\delta\) (or more simply with the number \(k\) in \(\epsilon=k~\sigma\).
This is easily settled. Earlier on, we defined the signal to noise ratio as snr\( = \mu / \sigma_{\mu}\). We indicated that this implicitly corresponds to a \(1~\sigma\) error bar: Measurement = \(\hat{\mu} \pm \sigma_{\mu}\). We now consider:
Measurement = \(\hat{\mu} \pm \epsilon_{\mu} = \hat{\mu} \pm k~\sigma_{\mu}\).
We redefine (more properly) the signal to noise ratio as:
snr\( = \mu / \epsilon_{\mu}\)
where \(\epsilon_{\mu} = k~\sigma_{\mu}\).
Thus \(\sigma_{\mu}\) can be expressed as:
\(\sigma_{\mu} = \mu / (k \)snr).
Applying the same defintion as before of numerical precision \(\nu = \log_{10}\) snr, we also get:
\(\sigma_{\mu} = \mu / (k 10^\nu)\).
And now, we can again apply the formula \(\sigma_{\mu} = \sigma / \sqrt{n}\) and solve for \(n\) to get:
SAMPLE SIZE TO GET A GIVEN PRECISION \(\nu\) and CONFIDENCE \(k\): \(~~~n = \frac{\sigma^2}{\mu^2} .k^2. 10^{2\nu}\) .
Bottomline: \(n\) grows only with the square of \(k\), but it grows exponentially with \(\nu\).
Let us pose:
\(n_{11}=\sigma^2 / \mu^2\)
We have the following table of growth of \(n\) with the number of significant digits \(\nu=\) and the number of sigmas in the error bar \(k\):
\(\nu=0\) 
\(\nu=1~digit\) 
\(\nu=2~digits\) 
\(\nu=3~digits\) 


\(\epsilon= 1~\sigma_{\mu}\) 
\(~~~~~~n_{11}\) 
\(100~n_{11}\) 
\(10000~n_{11}\) 
\(10^6~n_{11}\) 
\(\epsilon= 2~\sigma_{\mu}\) 
\(~~~~4~n_{11}\) 
\(400~n_{11}\) 
\(40000~n_{11}\) 
\(4~10^6~n_{11}\) 
\(\epsilon= 3~\sigma_{\mu}\) 
\(~~~~9~n_{11}\) 
\(900~n_{11}\) 
\(90000~n_{11}\) 
\( 9~10^6~n_{11}\) 
We typically recommend to use:
\(n=100~\sigma^2 / \mu^2\)
which corresponds to only \(1~\sigma_{\mu}\) confidence interval and \(\nu=1\) significant digit, which is the least we can do to get descent error bars. In the particular case of the Bernouilli distribution with parameter \(p\), we have \(\mu = p\) and \(\sigma^2 = p(1p)\). Thus if \(p \ll 1\), we find again:
\(n \simeq 100 /p\)
Bootstrap resampling¶
In the previous we found a simple scaling law for the number of examples needed to get a good precision of the mean. However, we made the assumption that the sample mean was (nearly) distributed with the Normal law. This is actually not so bad if the sample are drawn from the Bernoulli distribution, because the sum of Bernouilli random variables is distributed according to the Binomial law, which can be approximated with the Normal law for (relatively) large values of \(n\). But consider again the distribution of final exam grades:
gdf[col].hist(density=True)
plt.xlabel(col)
plt.ylabel('Density')
It has a “wierd” distribution. How could we know whether the distribution of its mean values (for different students, but similarly distributed) would be “Normal”? it is probably not normal. The idea othe the bootstrap is to sample \(n\) times (with replacement) examples from the \(n\) empirical values we have to obtain a “new” bootstrap sample, whose distribution is similar to that of our original sample. We do that \(b\) to get \(b\) bootstrap samples. Each of them has \(n\) examples (with some repetitions). For each sample of size \(n\), we can calculate its mean and see how it is distributed. Let’s give it a try!
def mean_bootstrap_samples(df, b=100000):
# take a df series and sample n values with replacement b times
# return a vector of mean values over the b bootstraps
n = len(df)
DF = np.random.choice(df, size =(n, b))
MU = np.mean(DF, axis=0)
return MU
bs = mean_bootstrap_samples(gdf[col])
mu, sigma = bs.mean(), bs.std() # mean and standard deviation
count, bins, ignored = plt.hist(bs, 30, density=True) # make the histogram
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * # plt the normal density
np.exp(  (bins  mu)**2 / (2 * sigma**2) ),
linewidth=2, color='r')
plt.xlabel("Mean " + col)
plt.ylabel('Density')
Morality: the density of the mean really looks like the Normal density.