Lecture 1: Crash course in statistics

We try to avoid frustrating you with too much probability and statistics. Concentrate to get the basics, so we can quickly get to the more fun stuff! What you need to absolutely know is in this notebook:

  1. Mean (moyenne), range, variance, standard deviation (écart type)

  2. Median, quartiles, percentiles (or quantiles)

  3. Scatter plots, covariance and correlation

  4. Random variable, Histograms, Probability Density Function (pdf)

  5. Pair plots

  6. Normal distribution and confidence intervals (error bars)

  7. Variance of the mean, standard error, numerical precision, and sample size

import numpy as np               # Library useful to do matrix algebra
import pandas as pd              # Library  useful to manipulate data tables (data frames) and visualize data
import seaborn as sns; sns.set() # Note useful data visualization functions
from scipy.stats import bernoulli# Errors made are like coin flips, the follow a Bernoulli distribution      
import matplotlib.pyplot as plt  # Library to make graphs 
import matplotlib.pylab as pylab # Set font sizes as desired
import matplotlib.ticker as ticker
params = {'legend.fontsize': 'x-large',
          'figure.figsize': (20, 10),
          'axes.labelsize': 'xx-large',
          'axes.titlesize':'xx-large',
          'xtick.labelsize':'xx-large',
          'ytick.labelsize':'xx-large'}
pylab.rcParams.update(params)
# Command to insert the graphs in line in the notebook:
%matplotlib inline       

from platform import python_version
print(python_version())          # Check which version of Python you are running

0. Load and visualize data

To get some concrete real example, we weighted 10 apples and 10 bananas (in grams). We also measured the length of the 10 bananas (in cm). Note that for a given index, there is no relation between the apple weight and the banana weight, but the banana weight and length are measured for the same banana.

We load the data as a Pandas dataframe.

df = pd.read_csv("AppleBanana.csv")

We use a nice feature of Pandas that allows us to visualize tables as a heat map. It appears immediately visually that AppleWeight and BananaWeight are not correlated (as expected), but BananaWeight and BananaLength seem to be correlated. Another observation is that there are two clusters of 5 apples, the first one is be composed of lighter apples than the second one.

df.style.background_gradient(cmap='Blues')

If we use the function describe() of Pandas, we get a bunch of statistics, le us see whether we understand what those are.

df.describe()

1. Mean, variance, and standard deviation

We are going the use the banana weights to illustrate the notions of Mean (moyenne), variance, and standard deviation (écart type).

First comes first: we VISUALIZE the weights of the bananas on a line.

def show_series(df, series='BananaWeight'):
    SortedBananaWeight = df[series].sort_values()
    fig = plt.figure(figsize=(20, 1))
    plt.plot(SortedBananaWeight, np.ones(len(SortedBananaWeight)), 'go')
    plt.xlabel('%s' % series)
    ax = plt.gca()
    ax.axes.yaxis.set_visible(False)
    tick_spacing = 1
    ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
show_series(df)

Mean:

In mathematical notations, if the weights of the bananas are denoted: \(x_1 = 149, x_2 = 154, \cdots, x_n = 151\), the mean will be:

\(\mu = \frac{1}{10} (x_1 + x_2 + \cdots + x_n) ~~= ~~\frac{1}{n}\sum_{i=1}^n x_i \)

that is:

\(\mu = \frac{1}{10} (149 + 154 + \cdots + 151)\)

Let us check that and verify the we obtain the same value as provided by df.describe().

mu = (1/10) * df['BananaWeight'].sum()
assert(mu== df['BananaWeight'].mean())
mu

We can now place the mean as a red vertical bar in the plot we made previously.

show_series(df)
plt.axvline(mu, alpha = 1, ymax = 1, linestyle = '-', color='red')
plt.title("Mean = red vertical bar")

Range, variance and standard deviation

We now need a measure that summarizes the dispersion of points (voir indicateurs de dispersion). We could use the range (en français: étendue). The range is define as the difference between the maximum (largest) and the minimum (smallest) values:

rng\(= \max_i(x_i) - \min_i(x_i) = 166-142 = 24\)

rng = df['BananaWeight'].max()-df['BananaWeight'].min()
rng

Another possibility would be to compute the mean absolute deviation (MAD) (en français: écart moyen) defined as:

mad\( = \frac{1}{n} \sum_{i=1}^n |x_i - \mu|\)

that is the average of the distances between the \(b_i\) values and the mean \(\mu\). Notice that if you omitted the absolute value, you would get: \(\frac{1}{n} \sum_{i=1}^n (x_i - \mu) = \frac{1}{n} \sum_{i=1}^n x_i - \frac{1}{n} n \mu = \mu - \mu = 0 \), which is not very useful.

mad = df['BananaWeight'].mad()
mad

But more commonly, people use the standard deviation (std ou stdev) (en français: écart type), which is the square root of the variance (en français: variance). The variance is defined as:

var\(~ = \sigma^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2\).

So the only difference between mad\(_i\) and var\(_b\) is that we SQUARE the differences \((x_i - \mu)\) instead of taking their absolute values. Then to get the standard deviation, on just needs to take the square root:

std\(~ = \sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2}\).

var = df['BananaWeight'].var()
var
std = df['BananaWeight'].std()
std
assert(std==np.sqrt(var))

Let us plot now as dashed red vertical bars \((\mu - \sigma)\) and \((\mu + \sigma)\):

def show_mean_std(df, series = 'BananaWeight', two_sigma=False):
    dfb = df[series]
    mu = dfb.mean()
    std = dfb.std()
    plt.axvline(mu, alpha = 1, ymax = 1, linestyle = '-', color='red')
    plt.axvline(mu-std, alpha = 1, ymax = 1, linestyle = '--', color='red')
    plt.axvline(mu+std, alpha = 1, ymax = 1, linestyle = '--', color='red')
    if two_sigma:
        plt.axvline(mu-2*std, alpha = 1, ymax = 1, linestyle = '--', color='blue')
        plt.axvline(mu+2*std, alpha = 1, ymax = 1, linestyle = '--', color='blue')
    
show_series(df)
show_mean_std(df)
plt.title("(mean + std) and (mean - std) = red dashed vertical bars")

2. Median, quartiles, percentiles (or quantiles)

So far we have explained in df.describe() mean and std. We also found that min and max are the smallest and largest values (respectively) and that they are used to compute the range rng=max-min.

Next we need to explain the lines marqued as 25%, 50%, and 75%. These are called percentiles.

Percentiles are expressed in percent (beween 0 and 100) while quantiles are expressed in probability values (between 0 and 1); they are otherwise the same notion:

There are some special cases pf percentiles that have particular names:

The median splits the examples in two halves with the same number of examples on either side. The quartiles split the examples in 4 equal parts.

df.describe()

Let us plot now:

  • as a solid green vertical bars the median,

  • as dashed green vertical bars the upper and lower quartiles,

  • as dashed black vertical bars the \(5^{th}\) and \(95^{th}\) percentiles.

def show_percentiles(df, series = 'BananaWeight', tail=0.10):
    dfb = df[series]
    # Calculate percentiles
    quant_L, quant_25, quant_50, quant_75, quant_U =  dfb.quantile(tail), dfb.quantile(0.25), dfb.quantile(0.5), dfb.quantile(0.75),  dfb.quantile(1-tail)
    print('QL=%5.2f, Q25=%5.2f, Q50=%5.2f, Q75=%5.2f, QU=%5.2f,' % (quant_L, quant_25, quant_50, quant_75, quant_U))
    
    # Plot the green lines
    plt.axvline(quant_25, alpha = 1, ymax = 1, linestyle = '--', color='green')
    plt.axvline(quant_50, alpha = 1, ymax = 1, linestyle = '-', color='green')
    plt.axvline(quant_75, alpha = 1, ymax = 1, linestyle = '--', color='green')

    # Plot the black lines
    plt.axvline(quant_L, alpha = 1, ymax = 1, linestyle = '--', color='black')
    plt.axvline(quant_U, alpha = 1, ymax = 1, linestyle = '--', color='black')
    
show_series(df)
show_percentiles(df)
plt.title("Median = solid green. Quartiles = dashed green. 10th and 90th percentiles = dashed black.")

First we can check that the potition of the 25, 50, and 75 percentiles in describe() coincide with those we found using quantile(). Second, we can be surprised of where the percentile threshold lines end up. If you read the Pandas documentation on quantiles, this is due to the fact that, by default, some linear interpolation is performed. You can play with the various options to see how the position of the vertical lines is affected.

Let us now show mean and std and percentiles on top of each other.

show_series(df)
show_mean_std(df)
show_percentiles(df)
plt.title("Mean, std and percentiles overlayed.")

3. Scatter plots, covariance, and correlation

In a data frame, a column is called a “series”. It can be interesting to plot one series vs. another. Let us plot BananaLength vs. BananaWeight, and compare it with AppleWeight vs. BananaWeight.

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 7))
df.plot(x='BananaWeight', y='BananaLength', kind="scatter", ax=ax1, color='g', s = 50)
df.plot(x='BananaWeight', y='AppleWeight', kind="scatter", ax=ax2, color='r', s = 50)

In the left plot, the points do no look super well aligned, still there seems a trend: the longer the banana, the heavier. In the right plot, as can be expected, there is no trend. We can actually QUANTIFY that by measuring how the series co-vary (vary together):

Covariance

Remember, we defined variance as the Mean Square Distance to the Mean, i.e. var\( = \sigma_x^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \mu_x)^2\).

By analogy with the variance, we define the co-variance as:

So given two series \(\{x_i\}\) and \(\{y_i\}\), the covariance is defined as:

\(\sigma_{xy} = \frac{1}{n} \sum_{i=1}^n (x_i - \mu_x)(y_i - \mu_y)\).

We show below the covariance matrix, i.e. the matrix of all covariances of pairs of series in a data frame. On the diagonal, we have \(\sigma^2\), the variance. Off diagonal, we have the covariances \(\sigma_{xy}\). Check that the variance of BananaWeight is the same as found above.

df.cov()

Correlation

The Pearson correlation coefficient is a normalized version of the covariance (so it solves the above problem of scale).

Mathematically:

\(\rho_{xy} = \frac{\sigma_{xy}}{\sigma_x \sigma_y}\)

The magnitude of \(\rho_{xy}\) represents the strength of the linear relation between the two series. Its values are between -1 and 1. Zero means no correlation, 1 means perfect correlation (points aligned, positive slope) and -1 means perfect anti-correlation (points aligned, negative slope).

The correlation matrix is the matrix of all correlations of pairs of series in a data frame. On the diagonal, we have \(\rho_{xx}\) values, which are always 1 (as expected from the definition of the correlation coefficient).

df.corr()

We are now reassured: we see that correlation(AppleWeight, BananaWeight) < correlation(BananaLength, BananaWeight), which is what we expected since AppleWeight and BananaWeight are NOT related, whereas BananaLength and BananaWeight ARE related.

To visualize the correlation matrix, we can tint the boxes and turn it into a heat map:

df.corr().style.background_gradient(cmap='Blues')

Unfortunately, as can be seen, the coloring of boxes for the same numerical value is different in the different columns. This is due to the fact that Pandas spans the color map between the smallest and the largest value in each column. To overcome this problem, we can visualize the correlation matrix with Seaborn:

sns.heatmap(df.corr(), fmt='0.2f', annot=True, square=True)

4. Random variables, histograms, and PDF

Consider a quantity like BananaWeight. We can consider the production of bananas as a random process. Given a very large number of banana, if we pick one at random, it will have a given weight. Each banana has a slightly different weight. We call BananaWeight a random variable (variable aléatoire). Random variables are often denoted with capital letters, such as \(X\), and particular measurements or observations (called “realizations”) are denoted with small letters, such as \(x\).

If we draw a very large number of bananas, the plots we made in section 1 (dots aligned on one axis), will become impossible to read (the dots will fall on one another). We will need another means of visualizing data. We are going to introduce histograms.

This is a way of representing as a bar graph Proba(\(X \in bin\)). If the number of examples grows to infinity and, simultaneously, the width of the bins goes to zero, a histogram becomes a density \(\rho(\)X = x\()\).

Grades info 111

As a first real life example, let us make histgrams of the grades of info 111. We replaced missing values by 0.

gdf = pd.read_csv("GradesInfo111.csv")
gdf.describe()
gdf.style.background_gradient(cmap='Blues')
col = gdf.columns[2]
sns.heatmap(gdf.sort_values(by=col).transpose())

Let us check one of the series, that of the third column:

show_series(gdf, series=col)
show_mean_std(gdf, series=col)
show_percentiles(gdf, series=col)

As can be seen, many points fall on top of one another. Hence it is difficult to visualize the distribution of grades in this way. Let us make a histogram.

gdf[col].hist()
plt.xlabel(col)
plt.ylabel('Counts')
show_mean_std(gdf, series = col)
show_percentiles(gdf, series = col)
plt.title("Mean = solid red. Mean +- std = dashed red. Median = solid green. Quartiles = dashed green. 10th and 90th percentiles = dashed black.")

Now it is easier to see that there is a higher density of good grades, except for a cluster of people at very low grade (probably people who did not show up at the exam).

Multiple histograms

Pandas allows you to easily draw simultaneously the histograms of all series.

gdf.hist(figsize=(20, 20))

5. Pair plots

When we have several series, it is convenient to show simultaneously histograms and scatter plots. Be patient, this is SLOW. On the diagonal you will see the histograms for the various series. Off diagonal you will see the scatter plots of all pairs of series.

sns.pairplot(gdf)

As you can see, several series are very correlated (points nearly aligned). We can quantify that with the correlation matrix.

sns.heatmap(gdf.corr(), fmt='0.2f', annot=True, square=True)

6. Normal Law and confidence intervals

Normal “Law” (“Loi” Normale)

Let us go back to our banana example. We were too lazy to weigh a large number of bananas. So we decided to simulate drawing 1000 examples of BananaWeight using the Normal law, but keeping the mean and standard deviation of our (small) real sample. Formally, we say the BananaWeight is a random variable distributed according to the Normal distribution or Gaussian distribution.

def draw_normal_bananas(df, series = 'BananaWeight', num=1000):
    # We draw num examples using the Normal law, using the mean and std of df[series]
    # By fixing the ramdom seed, we get the same sample every time we run the code
    np.random.seed(seed=1234)         # scipy.stats uses numpy.random to generate its random numbers
    mu, sigma = df[series].mean(), df[series].std() # mean and standard deviation
    s = np.random.normal(mu, sigma, num)                            # draw 1000 bananas
    count, bins, ignored = plt.hist(s, 30, density=True)            # make the histogram
    plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *                 # plot the Normal density
                np.exp( - (bins - mu)**2 / (2 * sigma**2) ),
                linewidth=2, color='r')
    plt.xlabel(series)
    plt.ylabel('Density')
    return pd.DataFrame(s, columns = [series])
    
draw_normal_bananas(df)

The red line represents the density of bananas having a given weight. The blue bars represent the histogram obtained for 1000 bananas drawn randomly according the a Normal distribution with mean and standard deviation that are given by our emplirical sample of 10 bananas, used before.

The height of the bars is proportional to the number of bananas that fall in the interval of weight defined by the width of the bar.

Mean, median, stdev, quartiles, and other quantiles, or percentiles

As before, we can plot as vertical lines the mean variance the mean, median, stdev, quartiles, and other quantiles, or percentiles. But we chose different quantiles for the black dashed lines. The first one uses P=0.025=2.5% and the last one P=0.975=97.5%. In this way, we get 95% of the probability mass between the two black dashed lines.

df_normal = draw_normal_bananas(df)
show_mean_std(df_normal, two_sigma=True)
show_percentiles(df_normal, tail=0.025)

For the Normal distribution:

  • The mean and the median coincide.

  • We have beween \(\mu - \sigma\) and \(\mu - \sigma\) (red dashed lines) 68% of the probability mass. The is more than between the quartiles (green dashed lines): 50% of the probability mass, by definition.

  • We have beween \(\mu - 2\sigma\) and \(\mu - 2\sigma\) (blue dashed lines) 95% of the probability mass (this is verified by showing as black dashed lines the 2.5 percentile).

  • The Normal distribution is completely characterized by only two parameters: \(\mu\) and \(\sigma\) (sufficient statistics).

Confidence intervals

If we want to give a an idea of a typical BananaWeight, we can just provide the mean (point estimation), or we can give an interval in which it is probable to find the weight of a typical banana.

For the Normal distribution, we have:

  • Proba\((\mu-\sigma<BananaWeight<\mu+\sigma)\) = 0.68

  • Proba\((\mu-2\sigma<BananaWeight<\mu+2\sigma)\) = 0.95

The latter is called a 95% confidence interval or 95% CI. People often report results as \(\mu \pm \sigma\). This implicitly means a 68% CI.

Not all distributions are symmetric like the Normal distribution. the bounds of a 95% CI are not necessarily placed on either side of the mean (or median). But in this class, for all practical purposes, you will be permitted to assume that \(\mu \pm 2\sigma\) is a 95% CI.

To get values for the three series in our fruit example, while respecting the covariance between the series, we can draw examples using the Normal multivariate distribution, using the covariance matrix of our (small) real data sample.

s = np.random.multivariate_normal(df.mean(), df.cov(), 1000)
df_multi = pd.DataFrame(s, columns = ['AppleWeight', 'BananaWeight', 'BananaLength'])
sns.pairplot(df_multi)

We can then check that the correlation matrix of the articial data resembles that of the real data.

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))
sns.heatmap(df_multi.corr(), fmt='0.2f', annot=True, square=True, ax=ax1)
ax1.set_title('Artificial data')
sns.heatmap(df.corr(), fmt='0.2f', annot=True, square=True, ax=ax2)
ax2.set_title('Real data')

6. Variance of the mean, standard error, error bars, numerical precision

Often, what we are interested in is not the distribution of a quantity like BananaWeight, but we want to estimate as precisely as possible the true MEAN of the quantity (its mathematical expectation). The larger the sample size n, the more precisely we will estimate the mean. In fact, the variance of the mean is inversely proportional to n.

Variance of the mean and standard error

We call \(\mu\) the true mean and \(\hat{\mu} = (1/n) (\sum_{i=1}^n X_i)\) the estimator of the empirical mean (computed with only n examples).

The standard deviation of the mean (also known as standard error) is obtained by taking the square root of the variance of the mean:

The law of large numbers, stipulates that as \(n\) goes to infinity the empirical mean \(\hat{\mu}\) converges to the “mathematical expectation” \(\mu\), and the variance of the mean goes to zero. Taking as example the Normal law and the 95% confidence interval we saw above, we easily see that:

Proba\((\mu - \sigma_{\mu} < \hat{\mu} < \mu - \sigma_{\mu}) = 0.95\)

Proba\((\mu - \sigma/\sqrt{n} < \hat{\mu} < \mu - \sigma/\sqrt{n}) = 0.95\)

Hence \(\hat{\mu}\) converges to \(\mu\) .

# For concreteness, we choose as mean and variance that of our empirical sample of banana weights
mu, sigma = df['BananaWeight'].mean(), df['BananaWeight'].std() # mean and standard deviation
# We make a note of the actual sample size
N = len(df['BananaWeight'])
# We make a note of the variance
V = sigma**2
# We select some sample sizes, growing exponentially, to verify the law of large numbers
n = [1, 10,100,1000,10000]
# We select a large batch size
b = 10000
v = []
for num in n:
    mu_vec = np.zeros((b,1))
    for i in range(b):
        # We draw a vector of size num containing fake banana weights
        x = np.random.normal(mu, sigma, num) 
        # We evaluate the mean for this number of bananas (num)
        mu_vec[i,0] = x.mean() # We repeat this b times and fill a vector of b mu values
    # We compute for each num value the variance of the mu, over the b batches
    v =  np.append(v, mu_vec.var())  
def log_log_plt(n, v, N, V):
    # Plot a list of variance values as a function of a list of n values
    # Indicate in blue the actual value of N and the variance of mean predicted by the theory
    # Assume n[0]=1
    plt.plot(n, v)
    plt.plot(n, v, 'ro')
    plt.plot(N, V/N, 'bo')
    ax = plt.gca()
    ax.set_xscale('log')
    ax.set_yscale('log')
    plt.xlabel('Number examples $n$')
    plt.ylabel('Variance of mean $\sigma_{\mu}^2$')
    plt.text(n[0], v[0], '$\sigma^2_{\mu}=\sigma^2$', fontsize=18, color='r') # coordinate at origin
    plt.text(N, V/N, '$\sigma^2_{\mu}=\sigma^2/N$', fontsize=18, color='b') # coordinate of original N value
    plt.title('Variation of $\sigma^2_{\mu}$ as a function of $n$, for $\sigma^2=$%.0f and N=%d' % (V, N))
    
log_log_plt(n, v, N, V)

We verify that the points are well aligned, with slope (-1) in the log-log scale, in accordance with:

The intercept (coordonnée à l’origine) is \(\log \sigma^2\). Hence is suffices for us to know (an empirical estimate of) \(\sigma^2\) to draw this entire curve, once we know the “theory”. In what follows we call \(\sigma\) the standard deviation estimated with our empirical sample and we assume that \(\sigma_{\mu}\) is computed with \(\sigma_{\mu} = \sigma/\sqrt{n}\), rather than being empirically estimated with \(b\) samples of the same size \(n\) (which would not make sense, because if we had \(bn\) samples, we would use all of them to get a more precise estimation of the mean.

The base of the log does not matter. Here for convenience we use log\(_{10}\).

Numerical precision and sample size

The question now is: How many examples do you need to get a given numerical precision?

There are several ways of defining numerical precision. In this lecture and for the next few weeks, we will define numerical precision as:

where snr is the signal to noise ratio:

Now, if we use ONE-sigma bars (68% condidence intervals for the Normal distribution) we will report results as:

Measurement = \(\hat{\mu} \pm \sigma_{\mu}\)

For example:

Measurement = \(1.00 \pm 0.01\)

In this case we will have \(\hat{\mu}/\sigma_{\mu}=100\), thus \(\nu=2\) and we see that \(\nu=2\) corresponds to the number of decimal places. If we have:

Measurement = \(123.000 \pm 0.123\)

then, \(\hat{\mu}/\sigma_{\mu}=1000\) , thus \(\nu=3\), and we will call that our numerical precision. We will keep only ONE trailing non zero number in the error bar and write the result as:

Measurement = \(123.0 \pm 0.1\)

Yet another example:

Measurement = \(0.00318123 \pm 0.0005\)

then, \(\hat{\mu}/\sigma_{\mu}=6.36\) , thus \(\nu=0.8\), we will write the result as:

Measurement = \(0.0032 \pm 0.0005\)

We sse that we have one non zero digit in \(\hat{\mu}\) above the error digit, hence one significant digit, consistent with the fact that round(\(\nu\))=1. Note that we rounded the last digit of \(\hat{\mu}\).

Solving for n, we get:

Or equivalently:

So given empirical estimates of the mean and the variance of the quantity of interest, we can compute the number of examples \(n\) we need to achieve a given precision \(\nu\). Unfortunately, \(n\) increases exponentially with \(\nu\). Note that technically \(n\) should be rounded to get an integer value.

# (1) This part is similar as before, except that we only give the 1st and the last point of the line
# the first one corrsponding to (n=1, sigma^2_mu=sigma^2) and the last (n=n_max, sigma^2_mu=sigma^2/n_max)
# We do not compute sigma^2_mu empirically, we just apply the formula sigma^2/n
# Compute the empirical mean
mu = df['BananaWeight'].mean()
# Compute the empirical variance
V = df['BananaWeight'].var()
# Find original value of n
N = len(df['BananaWeight'])
# Compute variance of the mean for the original sample number N
var_n = V/N
# Compute the original precision
nu_orig = np.round(np.log10(mu/np.sqrt(var_n)))
# Provide the original precision
print('Using N = %d examples, mean(BananaWeight) = %d +- %d (nu = %d)' % (N, np.round(mu), np.round(np.sqrt(var_n)), nu_orig))
# Compute variance of the mean for a vary large value of n (n_max)
n_max = 10000
var_n_max = V/n_max
# Make the log-log plot
log_log_plt([1, n_max], [V, var_n_max], N, V)

# (2) This part is new. We want to impose a larger number of significant digits.
# We choose a precision nu and compute the number of examples n_nu necessary to get that precision
# Choose precision
nu = 3 # Three significant digits
# Get var_n_nu = mu^2 / 10^2nu
var_new_n = (mu/10**nu)**2
# Get new n value at intersection
new_n = np.round((var/var_new_n))
# Plot horizontal line at var_n_new
plt.plot([1, new_n], [var_new_n, var_new_n])
print(var_new_n, V/new_n, np.sqrt(var_new_n))
print('Using N*= %d examples, mean(BananaWeight) = %5.1f +- %.1f (nu = %d)' % (new_n, mu, np.sqrt(var_new_n), nu))
# Plot vertical bar and show tex for new N value N*
plt.plot([new_n, new_n], [var_new_n, var_n_max], "--", color='black')
plt.plot(new_n, var_new_n, 'ko')
plt.text(new_n, var_new_n, '$\sigma^2_{\mu}=\sigma^2/N*$', fontsize=18, color='k') # coordinate of new N value
plt.text(N, var_new_n, '$\sigma^2_{\mu}=\hat{\mu}^2/10^{2\\nu}$', fontsize=18, color='k') # coordinate of new N value
plt.text(new_n/1.5, var_n_max/4, '$N*$', fontsize=18, color='k') # Desired value of N*
# Compute the empirical mean
mu = df['BananaWeight'].mean()
# Compute the empirical variance
V = df['BananaWeight'].var()
# Compute b
b = np.log10(V) - 2* np.log10(mu)
# Choose first and last point
min_nu = 0
max_nu = 3
min_logn = 2*min_nu +b
max_logn = 2*max_nu +b
min_n = 10**min_logn
max_n = 10**max_logn
# Plot n on a log scale
plt.plot([min_nu, max_nu], [min_n, max_n], "-", color='black')
ax = plt.gca()
ax.set_yscale('log')
plt.xlabel('Precision $\\nu$')
plt.ylabel('Number examples $n$')
# Indicate the precision of the original sample
N = len(df['BananaWeight'])
var_n = V/N
nu_orig = np.log10(mu/np.sqrt(var_n))
plt.plot(nu_orig, N, 'bo')

Bottomline: the number of examples increases exponetially with the precision.

Scaling the number of examples with the confidence

As indicated before, the size confidence interval depends on the degree of confidence we want to put into it. According to Wikipedia: “In statistics, the 68–95–99.7 rule, also known as the empirical rule, is a shorthand used to remember the percentage of values that lie within a band around the mean in a normal distribution with a width of two, four and six standard deviations, respectively; more precisely, 68.27%, 95.45% and 99.73% of the values lie within one, two and three standard deviations of the mean, respectively.” In mathematical form, this means:

Proba\(( |X-\mu| \leq \sigma ) \simeq 0.6827\)$

Proba\(( |X-\mu| \leq 2 \sigma ) \simeq 0.9545\)

Proba\(( |X-\mu| \leq 3 \sigma ) \simeq 0.9973\)

Even though not all distributions of interest are “normal”, many are nearly normal and it is usual, instead of reporting confidence intervals as:

Proba\(( |X-\mu| \leq \epsilon ) \simeq (1 - \delta)\)

where \(\delta\) is a small value, one only considers for \(\epsilon\) multiple values of \(\sigma\): \(1~\sigma\), \(2~\sigma\), \(3~\sigma\), \(\dots k~\sigma\), and one does not even bother reporting the exact corresponding value of \(\delta\).

One question of interest though is how does the number of examples needed to ensure a given confidence interval grow with \(\delta\) (or more simply with the number \(k\) in \(\epsilon=k~\sigma\).

This is easily settled. Earlier on, we defined the signal to noise ratio as snr\( = \mu / \sigma_{\mu}\). We indicated that this implicitly corresponds to a \(1~\sigma\) error bar: Measurement = \(\hat{\mu} \pm \sigma_{\mu}\). We now consider:

Measurement = \(\hat{\mu} \pm \epsilon_{\mu} = \hat{\mu} \pm k~\sigma_{\mu}\).

We re-define (more properly) the signal to noise ratio as:

snr\( = \mu / \epsilon_{\mu}\)

where \(\epsilon_{\mu} = k~\sigma_{\mu}\).

Thus \(\sigma_{\mu}\) can be expressed as:

\(\sigma_{\mu} = \mu / (k \)snr).

Applying the same defintion as before of numerical precision \(\nu = \log_{10}\) snr, we also get:

\(\sigma_{\mu} = \mu / (k 10^\nu)\).

And now, we can again apply the formula \(\sigma_{\mu} = \sigma / \sqrt{n}\) and solve for \(n\) to get:

Bottomline: \(n\) grows only with the square of \(k\), but it grows exponentially with \(\nu\).

Let us pose:

\(n_{11}=\sigma^2 / \mu^2\)

We have the following table of growth of \(n\) with the number of significant digits \(\nu=\) and the number of sigmas in the error bar \(k\):

\(\nu=0\)

\(\nu=1~digit\)

\(\nu=2~digits\)

\(\nu=3~digits\)

\(\epsilon= 1~\sigma_{\mu}\)

\(~~~~~~n_{11}\)

\(100~n_{11}\)

\(10000~n_{11}\)

\(10^6~n_{11}\)

\(\epsilon= 2~\sigma_{\mu}\)

\(~~~~4~n_{11}\)

\(400~n_{11}\)

\(40000~n_{11}\)

\(4~10^6~n_{11}\)

\(\epsilon= 3~\sigma_{\mu}\)

\(~~~~9~n_{11}\)

\(900~n_{11}\)

\(90000~n_{11}\)

\( 9~10^6~n_{11}\)

We typically recommend to use:

\(n=100~\sigma^2 / \mu^2\)

which corresponds to only \(1~\sigma_{\mu}\) confidence interval and \(\nu=1\) significant digit, which is the least we can do to get descent error bars. In the particular case of the Bernouilli distribution with parameter \(p\), we have \(\mu = p\) and \(\sigma^2 = p(1-p)\). Thus if \(p \ll 1\), we find again:

\(n \simeq 100 /p\)

Bootstrap resampling

In the previous we found a simple scaling law for the number of examples needed to get a good precision of the mean. However, we made the assumption that the sample mean was (nearly) distributed with the Normal law. This is actually not so bad if the sample are drawn from the Bernoulli distribution, because the sum of Bernouilli random variables is distributed according to the Binomial law, which can be approximated with the Normal law for (relatively) large values of \(n\). But consider again the distribution of final exam grades:

gdf[col].hist(density=True)
plt.xlabel(col)
plt.ylabel('Density')

It has a “wierd” distribution. How could we know whether the distribution of its mean values (for different students, but similarly distributed) would be “Normal”? it is probably not normal. The idea othe the bootstrap is to sample \(n\) times (with replacement) examples from the \(n\) empirical values we have to obtain a “new” bootstrap sample, whose distribution is similar to that of our original sample. We do that \(b\) to get \(b\) bootstrap samples. Each of them has \(n\) examples (with some repetitions). For each sample of size \(n\), we can calculate its mean and see how it is distributed. Let’s give it a try!

def mean_bootstrap_samples(df, b=100000):
    # take a df series and sample n values with replacement b times
    # return a vector of mean values over the b bootstraps
    n = len(df)
    DF = np.random.choice(df, size =(n, b))
    MU = np.mean(DF, axis=0)
    return MU
bs = mean_bootstrap_samples(gdf[col])
mu, sigma = bs.mean(), bs.std()                           # mean and standard deviation
count, bins, ignored = plt.hist(bs, 30, density=True)     # make the histogram
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *           # plt the normal density
            np.exp( - (bins - mu)**2 / (2 * sigma**2) ),
            linewidth=2, color='r')
plt.xlabel("Mean " + col)
plt.ylabel('Density')

Morality: the density of the mean really looks like the Normal density.