--- jupytext: text_representation: extension: .md format_name: myst format_version: 0.13 jupytext_version: 1.10.3 kernelspec: display_name: Python 3 language: python name: python3 --- # Lecture 1: Crash course in statistics We try to avoid frustrating you with too much probability and statistics. Concentrate to get the basics, so we can quickly get to the more fun stuff! What you need to absolutely know is in this notebook: 1. Mean (moyenne), range, variance, standard deviation (écart type) 1. Median, quartiles, percentiles (or quantiles) 1. Scatter plots, covariance and correlation 1. Random variable, Histograms, Probability Density Function (pdf) 1. Pair plots 1. Normal distribution and confidence intervals (error bars) 1. Variance of the mean, standard error, numerical precision, and sample size {code-cell} ipython3 import numpy as np # Library useful to do matrix algebra import pandas as pd # Library useful to manipulate data tables (data frames) and visualize data import seaborn as sns; sns.set() # Note useful data visualization functions from scipy.stats import bernoulli# Errors made are like coin flips, the follow a Bernoulli distribution import matplotlib.pyplot as plt # Library to make graphs import matplotlib.pylab as pylab # Set font sizes as desired import matplotlib.ticker as ticker params = {'legend.fontsize': 'x-large', 'figure.figsize': (20, 10), 'axes.labelsize': 'xx-large', 'axes.titlesize':'xx-large', 'xtick.labelsize':'xx-large', 'ytick.labelsize':'xx-large'} pylab.rcParams.update(params) # Command to insert the graphs in line in the notebook: %matplotlib inline from platform import python_version print(python_version()) # Check which version of Python you are running  ## 0. Load and visualize data To get some concrete real example, we weighted 10 apples and 10 bananas (in grams). We also measured the length of the 10 bananas (in cm). Note that for a given index, there is no relation between the apple weight and the banana weight, but the banana weight and length are measured for the same banana. We load the data as a Pandas dataframe. {code-cell} ipython3 df = pd.read_csv("AppleBanana.csv")  We use a nice feature of Pandas that allows us to visualize tables as a heat map. It appears immediately visually that AppleWeight and BananaWeight are not correlated (as expected), but BananaWeight and BananaLength seem to be correlated. Another observation is that there are two clusters of 5 apples, the first one is be composed of lighter apples than the second one. {code-cell} ipython3 df.style.background_gradient(cmap='Blues')  If we use the function describe() of Pandas, we get a bunch of statistics, le us see whether we understand what those are. {code-cell} ipython3 df.describe()  ## 1. Mean, variance, and standard deviation We are going the use the banana weights to illustrate the notions of Mean (moyenne), variance, and standard deviation (écart type). First comes first: **we VISUALIZE the weights of the bananas on a line.** {code-cell} ipython3 def show_series(df, series='BananaWeight'): SortedBananaWeight = df[series].sort_values() fig = plt.figure(figsize=(20, 1)) plt.plot(SortedBananaWeight, np.ones(len(SortedBananaWeight)), 'go') plt.xlabel('%s' % series) ax = plt.gca() ax.axes.yaxis.set_visible(False) tick_spacing = 1 ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing)) show_series(df)  ### Mean: In mathematical notations, if the weights of the bananas are denoted: $x_1 = 149, x_2 = 154, \cdots, x_n = 151$, the mean will be: $\mu = \frac{1}{10} (x_1 + x_2 + \cdots + x_n) ~~= ~~\frac{1}{n}\sum_{i=1}^n x_i$ that is: $\mu = \frac{1}{10} (149 + 154 + \cdots + 151)$ Let us check that and verify the we obtain the same value as provided by df.describe(). {code-cell} ipython3 mu = (1/10) * df['BananaWeight'].sum() assert(mu== df['BananaWeight'].mean()) mu  We can now place the mean as a red vertical bar in the plot we made previously. {code-cell} ipython3 show_series(df) plt.axvline(mu, alpha = 1, ymax = 1, linestyle = '-', color='red') plt.title("Mean = red vertical bar")  ### Range, variance and standard deviation We now need a measure that summarizes the dispersion of points (voir **[indicateurs de dispersion](https://fr.wikipedia.org/wiki/Indicateur_de_dispersion)**). We could use the **range** (en français: étendue). The range is define as the difference between the maximum (largest) and the minimum (smallest) values: rng$= \max_i(x_i) - \min_i(x_i) = 166-142 = 24$ {code-cell} ipython3 rng = df['BananaWeight'].max()-df['BananaWeight'].min() rng  Another possibility would be to compute the **[mean absolute deviation](https://en.wikipedia.org/wiki/Average_absolute_deviation)** (MAD) (en français: écart moyen) defined as: mad$= \frac{1}{n} \sum_{i=1}^n |x_i - \mu|$ that is the average of the distances between the $b_i$ values and the mean $\mu$. Notice that if you omitted the absolute value, you would get: $\frac{1}{n} \sum_{i=1}^n (x_i - \mu) = \frac{1}{n} \sum_{i=1}^n x_i - \frac{1}{n} n \mu = \mu - \mu = 0$, which is not very useful. {code-cell} ipython3 mad = df['BananaWeight'].mad() mad  But more commonly, people use the **[standard deviation](https://en.wikipedia.org/wiki/Standard_deviation)** (std ou stdev) (en français: écart type), which is the square root of the **[variance](https://en.wikipedia.org/wiki/Variance)** (en français: variance). The variance is defined as: var$~ = \sigma^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2$. So the only difference between mad$_i$ and var$_b$ is that we SQUARE the differences $(x_i - \mu)$ instead of taking their absolute values. Then to get the standard deviation, on just needs to take the square root: std$~ = \sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2}$. {code-cell} ipython3 var = df['BananaWeight'].var() var  {code-cell} ipython3 std = df['BananaWeight'].std() std  {code-cell} ipython3 assert(std==np.sqrt(var))  Let us plot now as dashed red vertical bars $(\mu - \sigma)$ and $(\mu + \sigma)$: {code-cell} ipython3 def show_mean_std(df, series = 'BananaWeight', two_sigma=False): dfb = df[series] mu = dfb.mean() std = dfb.std() plt.axvline(mu, alpha = 1, ymax = 1, linestyle = '-', color='red') plt.axvline(mu-std, alpha = 1, ymax = 1, linestyle = '--', color='red') plt.axvline(mu+std, alpha = 1, ymax = 1, linestyle = '--', color='red') if two_sigma: plt.axvline(mu-2*std, alpha = 1, ymax = 1, linestyle = '--', color='blue') plt.axvline(mu+2*std, alpha = 1, ymax = 1, linestyle = '--', color='blue') show_series(df) show_mean_std(df) plt.title("(mean + std) and (mean - std) = red dashed vertical bars")  ## 2. Median, quartiles, percentiles (or quantiles) So far we have explained in df.describe() mean and std. We also found that min and max are the smallest and largest values (respectively) and that they are used to compute the range rng=max-min. Next we need to explain the lines marqued as 25\%, 50\%, and 75\%. These are called **[percentiles](https://en.wikipedia.org/wiki/Percentile)**. Percentiles are expressed in percent (beween 0 and 100) while quantiles are expressed in probability values (between 0 and 1); they are otherwise the same notion: There are some special cases pf percentiles that have particular names: The **median** splits the examples in two halves with the same number of examples on either side. The **quartiles** split the examples in 4 equal parts. {code-cell} ipython3 df.describe()  Let us plot now: * as a solid green vertical bars the median, * as dashed green vertical bars the upper and lower quartiles, * as dashed black vertical bars the $5^{th}$ and $95^{th}$ percentiles. {code-cell} ipython3 def show_percentiles(df, series = 'BananaWeight', tail=0.10): dfb = df[series] # Calculate percentiles quant_L, quant_25, quant_50, quant_75, quant_U = dfb.quantile(tail), dfb.quantile(0.25), dfb.quantile(0.5), dfb.quantile(0.75), dfb.quantile(1-tail) print('QL=%5.2f, Q25=%5.2f, Q50=%5.2f, Q75=%5.2f, QU=%5.2f,' % (quant_L, quant_25, quant_50, quant_75, quant_U)) # Plot the green lines plt.axvline(quant_25, alpha = 1, ymax = 1, linestyle = '--', color='green') plt.axvline(quant_50, alpha = 1, ymax = 1, linestyle = '-', color='green') plt.axvline(quant_75, alpha = 1, ymax = 1, linestyle = '--', color='green') # Plot the black lines plt.axvline(quant_L, alpha = 1, ymax = 1, linestyle = '--', color='black') plt.axvline(quant_U, alpha = 1, ymax = 1, linestyle = '--', color='black') show_series(df) show_percentiles(df) plt.title("Median = solid green. Quartiles = dashed green. 10th and 90th percentiles = dashed black.")  First we can check that the potition of the 25, 50, and 75 percentiles in describe() coincide with those we found using quantile(). Second, we can be surprised of where the percentile threshold lines end up. If you read the [Pandas documentation on quantiles](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html), this is due to the fact that, by default, some linear interpolation is performed. You can play with the various options to see how the position of the vertical lines is affected. +++ Let us now show mean and std and percentiles on top of each other. {code-cell} ipython3 show_series(df) show_mean_std(df) show_percentiles(df) plt.title("Mean, std and percentiles overlayed.")  ## 3. Scatter plots, covariance, and correlation In a data frame, a column is called a "series". It can be interesting to plot one series vs. another. Let us plot BananaLength vs. BananaWeight, and compare it with AppleWeight vs. BananaWeight. {code-cell} ipython3 fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 7)) df.plot(x='BananaWeight', y='BananaLength', kind="scatter", ax=ax1, color='g', s = 50) df.plot(x='BananaWeight', y='AppleWeight', kind="scatter", ax=ax2, color='r', s = 50)  In the left plot, the points do no look super well aligned, still there seems a trend: the longer the banana, the heavier. In the right plot, as can be expected, there is no trend. **We can actually QUANTIFY that by measuring how the series co-vary (vary together):** +++ ### Covariance Remember, we defined variance as the Mean Square Distance to the Mean, i.e. var$= \sigma_x^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \mu_x)^2$. By analogy with the variance, we define the **[co-variance](https://en.wikipedia.org/wiki/Covariance)** as: So given two series $\{x_i\}$ and $\{y_i\}$, the **covariance** is defined as: $\sigma_{xy} = \frac{1}{n} \sum_{i=1}^n (x_i - \mu_x)(y_i - \mu_y)$. We show below the **covariance matrix**, i.e. the matrix of all covariances of pairs of series in a data frame. On the diagonal, we have $\sigma^2$, the variance. Off diagonal, we have the covariances $\sigma_{xy}$. Check that the variance of BananaWeight is the same as found above. {code-cell} ipython3 df.cov()  ### Correlation The **[Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)** is a normalized version of the covariance (**so it solves the above problem of scale**). Mathematically: $\rho_{xy} = \frac{\sigma_{xy}}{\sigma_x \sigma_y}$ The magnitude of $\rho_{xy}$ represents the strength of the linear relation between the two series. Its values are between -1 and 1. Zero means no correlation, 1 means perfect correlation (points aligned, positive slope) and -1 means perfect anti-correlation (points aligned, negative slope). The **correlation matrix** is the matrix of all correlations of pairs of series in a data frame. On the diagonal, we have $\rho_{xx}$ values, which are always 1 (as expected from the definition of the correlation coefficient). {code-cell} ipython3 df.corr()  We are now reassured: we see that correlation(AppleWeight, BananaWeight) < correlation(BananaLength, BananaWeight), which is what we expected since AppleWeight and BananaWeight are NOT related, whereas BananaLength and BananaWeight ARE related. To visualize the correlation matrix, we can tint the boxes and turn it into a heat map: {code-cell} ipython3 df.corr().style.background_gradient(cmap='Blues')  Unfortunately, as can be seen, the coloring of boxes for the same numerical value is different in the different columns. This is due to the fact that Pandas spans the color map between the smallest and the largest value in each column. To overcome this problem, we can visualize the correlation matrix with Seaborn: {code-cell} ipython3 sns.heatmap(df.corr(), fmt='0.2f', annot=True, square=True)  ## 4. Random variables, histograms, and PDF Consider a quantity like BananaWeight. We can consider the production of bananas as a random process. Given a very large number of banana, if we pick one at random, it will have a given weight. Each banana has a slightly different weight. We call BananaWeight a **random variable** (variable aléatoire). Random variables are often denoted with **capital letters**, such as $X$, and particular measurements or observations (called "realizations") are denoted with small letters, such as $x$. If we draw a very large number of bananas, the plots we made in section 1 (dots aligned on one axis), will become impossible to read (the dots will fall on one another). We will need another means of visualizing data. We are going to introduce **histograms**. This is a way of representing as a bar graph Proba($X \in bin$). If the number of examples grows to infinity and, simultaneously, the width of the bins goes to zero, a histogram becomes a density $\rho($X = x$)$. +++ ### Grades info 111 As a first real life example, let us make histgrams of the grades of info 111. We replaced missing values by 0. {code-cell} ipython3 gdf = pd.read_csv("GradesInfo111.csv") gdf.describe()  {code-cell} ipython3 gdf.style.background_gradient(cmap='Blues')  {code-cell} ipython3 col = gdf.columns sns.heatmap(gdf.sort_values(by=col).transpose())  Let us check one of the series, that of the third column: {code-cell} ipython3 show_series(gdf, series=col) show_mean_std(gdf, series=col) show_percentiles(gdf, series=col)  As can be seen, many points fall on top of one another. Hence it is difficult to visualize the distribution of grades in this way. Let us make a histogram. {code-cell} ipython3 gdf[col].hist() plt.xlabel(col) plt.ylabel('Counts') show_mean_std(gdf, series = col) show_percentiles(gdf, series = col) plt.title("Mean = solid red. Mean +- std = dashed red. Median = solid green. Quartiles = dashed green. 10th and 90th percentiles = dashed black.")  Now it is easier to see that there is a higher density of good grades, except for a cluster of people at very low grade (probably people who did not show up at the exam). +++ ### Multiple histograms Pandas allows you to easily draw simultaneously the histograms of all series. {code-cell} ipython3 gdf.hist(figsize=(20, 20))  ## 5. Pair plots When we have several series, it is convenient to show simultaneously histograms and scatter plots. Be patient, this is SLOW. On the diagonal you will see the histograms for the various series. Off diagonal you will see the scatter plots of all pairs of series. {code-cell} ipython3 sns.pairplot(gdf)  As you can see, several series are very correlated (points nearly aligned). We can quantify that with the correlation matrix. {code-cell} ipython3 sns.heatmap(gdf.corr(), fmt='0.2f', annot=True, square=True)  ## 6. Normal Law and confidence intervals ### Normal "Law" ("Loi" Normale) Let us go back to our banana example. We were too lazy to weigh a large number of bananas. So we decided to simulate drawing 1000 examples of BananaWeight using the Normal law, but **keeping the mean and standard deviation of our (small) real sample**. Formally, we say the BananaWeight is a random variable distributed according to the **[Normal distribution](https://en.wikipedia.org/wiki/Normal_distribution)** or **Gaussian distribution**. {code-cell} ipython3 def draw_normal_bananas(df, series = 'BananaWeight', num=1000): # We draw num examples using the Normal law, using the mean and std of df[series] # By fixing the ramdom seed, we get the same sample every time we run the code np.random.seed(seed=1234) # scipy.stats uses numpy.random to generate its random numbers mu, sigma = df[series].mean(), df[series].std() # mean and standard deviation s = np.random.normal(mu, sigma, num) # draw 1000 bananas count, bins, ignored = plt.hist(s, 30, density=True) # make the histogram plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * # plot the Normal density np.exp( - (bins - mu)**2 / (2 * sigma**2) ), linewidth=2, color='r') plt.xlabel(series) plt.ylabel('Density') return pd.DataFrame(s, columns = [series]) draw_normal_bananas(df)  The red line represents the **density** of bananas having a given weight. The blue bars represent the histogram obtained for 1000 bananas drawn randomly according the a Normal distribution with mean and standard deviation that are given by our emplirical sample of 10 bananas, used before. The height of the bars is proportional to the number of bananas that fall in the interval of weight defined by the width of the bar. +++ ### Mean, median, stdev, quartiles, and other quantiles, or percentiles As before, we can plot as vertical lines the mean variance the mean, median, stdev, quartiles, and other quantiles, or percentiles. But we chose different quantiles for the black dashed lines. The first one uses P=0.025=2.5% and the last one P=0.975=97.5%. In this way, we get 95% of the probability mass between the two black dashed lines. {code-cell} ipython3 df_normal = draw_normal_bananas(df) show_mean_std(df_normal, two_sigma=True) show_percentiles(df_normal, tail=0.025)  For the **[Normal distribution](https://towardsdatascience.com/understanding-the-68-95-99-7-rule-for-a-normal-distribution-b7b7cbf760c2)**: * The mean and the median coincide. * We have beween $\mu - \sigma$ and $\mu - \sigma$ (red dashed lines) 68% of the probability mass. The is more than between the quartiles (green dashed lines): 50% of the probability mass, by definition. * We have beween $\mu - 2\sigma$ and $\mu - 2\sigma$ (blue dashed lines) 95% of the probability mass (this is verified by showing as black dashed lines the 2.5 percentile). * The Normal distribution is completely characterized by only two parameters: $\mu$ and $\sigma$ (sufficient statistics). +++ ### Confidence intervals If we want to give a an idea of a typical BananaWeight, we can just provide the mean (point estimation), or we can give an interval in which it is probable to find the weight of a typical banana. For the Normal distribution, we have: * Proba$(\mu-\sigma ERROR BAR: There are multiple definitions of error bars. When we write We mean Measurement =$\mu \pm \sigma$A 68% CI, for data Normally distributed. Proba$(\mu-\sigma
Measurement = $\mu \pm 2 \sigma$ A 95% CI, for data Normally distributed. Proba$(\mu-2\sigma Measurement =$[L, U]$, 95% CI$L$is the lower bound and$U$the upper bound of the 95% CI +++ To get values for the three series in our fruit example, while **respecting the covariance between the series**, we can draw examples using the Normal multivariate distribution, using the covariance matrix of our (small) real data sample. {code-cell} ipython3 s = np.random.multivariate_normal(df.mean(), df.cov(), 1000) df_multi = pd.DataFrame(s, columns = ['AppleWeight', 'BananaWeight', 'BananaLength']) sns.pairplot(df_multi)  We can then check that the correlation matrix of the articial data resembles that of the real data. {code-cell} ipython3 fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10)) sns.heatmap(df_multi.corr(), fmt='0.2f', annot=True, square=True, ax=ax1) ax1.set_title('Artificial data') sns.heatmap(df.corr(), fmt='0.2f', annot=True, square=True, ax=ax2) ax2.set_title('Real data')  ## 6. Variance of the mean, standard error, error bars, numerical precision +++ Often, what we are interested in is not the distribution of a quantity like BananaWeight, but we want to estimate as precisely as possible the true MEAN of the quantity (its mathematical expectation). The larger the sample size n, the more precisely we will estimate the mean. In fact, the **variance of the mean is inversely proportional to n**. ### Variance of the mean and standard error We call$\mu$the true mean and$\hat{\mu} = (1/n) (\sum_{i=1}^n X_i)$the estimator of the empirical mean (computed with only n examples). The standard deviation of the mean (also known as **standard error**) is obtained by taking the square root of the variance of the mean: The **[law of large numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers)**, stipulates that as$n$goes to infinity the empirical mean$\hat{\mu}$converges to the "mathematical expectation"$\mu$, and the variance of the mean goes to zero. Taking as example the Normal law and the 95% confidence interval we saw above, we easily see that: Proba$(\mu - \sigma_{\mu} < \hat{\mu} < \mu - \sigma_{\mu}) = 0.95$Proba$(\mu - \sigma/\sqrt{n} < \hat{\mu} < \mu - \sigma/\sqrt{n}) = 0.95$Hence$\hat{\mu}$converges to$\mu$. {code-cell} ipython3 # For concreteness, we choose as mean and variance that of our empirical sample of banana weights mu, sigma = df['BananaWeight'].mean(), df['BananaWeight'].std() # mean and standard deviation # We make a note of the actual sample size N = len(df['BananaWeight']) # We make a note of the variance V = sigma**2 # We select some sample sizes, growing exponentially, to verify the law of large numbers n = [1, 10,100,1000,10000] # We select a large batch size b = 10000 v = [] for num in n: mu_vec = np.zeros((b,1)) for i in range(b): # We draw a vector of size num containing fake banana weights x = np.random.normal(mu, sigma, num) # We evaluate the mean for this number of bananas (num) mu_vec[i,0] = x.mean() # We repeat this b times and fill a vector of b mu values # We compute for each num value the variance of the mu, over the b batches v = np.append(v, mu_vec.var())  {code-cell} ipython3 def log_log_plt(n, v, N, V): # Plot a list of variance values as a function of a list of n values # Indicate in blue the actual value of N and the variance of mean predicted by the theory # Assume n=1 plt.plot(n, v) plt.plot(n, v, 'ro') plt.plot(N, V/N, 'bo') ax = plt.gca() ax.set_xscale('log') ax.set_yscale('log') plt.xlabel('Number examples$n$') plt.ylabel('Variance of mean$\sigma_{\mu}^2$') plt.text(n, v, '$\sigma^2_{\mu}=\sigma^2$', fontsize=18, color='r') # coordinate at origin plt.text(N, V/N, '$\sigma^2_{\mu}=\sigma^2/N$', fontsize=18, color='b') # coordinate of original N value plt.title('Variation of$\sigma^2_{\mu}$as a function of$n$, for$\sigma^2=%.0f and N=%d' % (V, N)) log_log_plt(n, v, N, V)  We verify that the points are well aligned, with slope (-1) in the log-log scale, in accordance with: The intercept (coordonnée à l'origine) is\log \sigma^2$. Hence is suffices for us to know (an empirical estimate of)$\sigma^2$to draw this entire curve, once we know the "theory". In what follows we call$\sigma$the standard deviation estimated with our empirical sample and we assume that$\sigma_{\mu}$is computed with$\sigma_{\mu} = \sigma/\sqrt{n}$, rather than being empirically estimated with$b$samples of the same size$n$(which would not make sense, because if we had$bn$samples, we would use all of them to get a more precise estimation of the mean. The base of the log does not matter. Here for convenience we use log$_{10}$. +++ ### Numerical precision and sample size The question now is: **How many examples do you need to get a given numerical precision**? There are several ways of defining [numerical precision](http://web.mnstate.edu/marasing/CHEM380/Documents/Statistics%20I%202017.pdf). In this lecture and for the next few weeks, we will define numerical precision as: where snr is the signal to noise ratio: Now, if we use ONE-sigma bars (68\% condidence intervals for the Normal distribution) we will report results as: Measurement =$\hat{\mu} \pm \sigma_{\mu}$For example: Measurement =$1.00 \pm 0.01$In this case we will have$\hat{\mu}/\sigma_{\mu}=100$, thus$\nu=2$and we see that$\nu=2$corresponds to the number of decimal places. If we have: Measurement =$123.000 \pm 0.123$then,$\hat{\mu}/\sigma_{\mu}=1000$, thus$\nu=3$, and we will call that our numerical precision. We will keep only ONE trailing non zero number in the error bar and write the result as: Measurement =$123.0 \pm 0.1$Yet another example: Measurement =$0.00318123 \pm 0.0005$then,$\hat{\mu}/\sigma_{\mu}=6.36$, thus$\nu=0.8$, we will write the result as: Measurement =$0.0032 \pm 0.0005$We sse that we have one non zero digit in$\hat{\mu}$above the error digit, hence one significant digit, consistent with the fact that round($\nu$)=1. Note that we rounded the last digit of$\hat{\mu}$. Solving for n, we get: Or equivalently: So given empirical estimates of the mean and the variance of the quantity of interest, we can compute the number of examples$n$we need to achieve a given precision$\nu$. Unfortunately, **$n$increases exponentially with$\nu$.** Note that technically$n$should be rounded to get an integer value. +++ {code-cell} ipython3 # (1) This part is similar as before, except that we only give the 1st and the last point of the line # the first one corrsponding to (n=1, sigma^2_mu=sigma^2) and the last (n=n_max, sigma^2_mu=sigma^2/n_max) # We do not compute sigma^2_mu empirically, we just apply the formula sigma^2/n # Compute the empirical mean mu = df['BananaWeight'].mean() # Compute the empirical variance V = df['BananaWeight'].var() # Find original value of n N = len(df['BananaWeight']) # Compute variance of the mean for the original sample number N var_n = V/N # Compute the original precision nu_orig = np.round(np.log10(mu/np.sqrt(var_n))) # Provide the original precision print('Using N = %d examples, mean(BananaWeight) = %d +- %d (nu = %d)' % (N, np.round(mu), np.round(np.sqrt(var_n)), nu_orig)) # Compute variance of the mean for a vary large value of n (n_max) n_max = 10000 var_n_max = V/n_max # Make the log-log plot log_log_plt([1, n_max], [V, var_n_max], N, V) # (2) This part is new. We want to impose a larger number of significant digits. # We choose a precision nu and compute the number of examples n_nu necessary to get that precision # Choose precision nu = 3 # Three significant digits # Get var_n_nu = mu^2 / 10^2nu var_new_n = (mu/10**nu)**2 # Get new n value at intersection new_n = np.round((var/var_new_n)) # Plot horizontal line at var_n_new plt.plot([1, new_n], [var_new_n, var_new_n]) print(var_new_n, V/new_n, np.sqrt(var_new_n)) print('Using N*= %d examples, mean(BananaWeight) = %5.1f +- %.1f (nu = %d)' % (new_n, mu, np.sqrt(var_new_n), nu)) # Plot vertical bar and show tex for new N value N* plt.plot([new_n, new_n], [var_new_n, var_n_max], "--", color='black') plt.plot(new_n, var_new_n, 'ko') plt.text(new_n, var_new_n, '$\sigma^2_{\mu}=\sigma^2/N*$', fontsize=18, color='k') # coordinate of new N value plt.text(N, var_new_n, '$\sigma^2_{\mu}=\hat{\mu}^2/10^{2\\nu}$', fontsize=18, color='k') # coordinate of new N value plt.text(new_n/1.5, var_n_max/4, '$N*$', fontsize=18, color='k') # Desired value of N*  {code-cell} ipython3 # Compute the empirical mean mu = df['BananaWeight'].mean() # Compute the empirical variance V = df['BananaWeight'].var() # Compute b b = np.log10(V) - 2* np.log10(mu) # Choose first and last point min_nu = 0 max_nu = 3 min_logn = 2*min_nu +b max_logn = 2*max_nu +b min_n = 10**min_logn max_n = 10**max_logn # Plot n on a log scale plt.plot([min_nu, max_nu], [min_n, max_n], "-", color='black') ax = plt.gca() ax.set_yscale('log') plt.xlabel('Precision$\\nu$') plt.ylabel('Number examples$n$') # Indicate the precision of the original sample N = len(df['BananaWeight']) var_n = V/N nu_orig = np.log10(mu/np.sqrt(var_n)) plt.plot(nu_orig, N, 'bo')  Bottomline: the number of examples increases exponetially with the precision. +++ ### Scaling the number of examples with the confidence As indicated before, the size confidence interval depends on the degree of confidence we want to put into it. According to [Wikipedia](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule): "In statistics, the **68–95–99.7 rule**, also known as the empirical rule, is a shorthand used to remember the percentage of values that lie within a band around the mean in a **normal distribution** with a width of two, four and six standard deviations, respectively; more precisely, 68.27%, 95.45% and 99.73% of the values lie within **one, two and three standard deviations of the mean**, respectively." In mathematical form, this means: Proba$( |X-\mu| \leq \sigma ) \simeq 0.6827 Proba$( |X-\mu| \leq 2 \sigma ) \simeq 0.9545$ Proba$( |X-\mu| \leq 3 \sigma ) \simeq 0.9973$ Even though not all distributions of interest are "normal", many are nearly normal and it is usual, instead of reporting confidence intervals as: Proba$( |X-\mu| \leq \epsilon ) \simeq (1 - \delta)$ where $\delta$ is a small value, one only considers for $\epsilon$ multiple values of $\sigma$: $1~\sigma$, $2~\sigma$, $3~\sigma$, $\dots k~\sigma$, and one does not even bother reporting the exact corresponding value of $\delta$. One question of interest though is how does the number of examples needed to ensure a given confidence interval grow with $\delta$ (or more simply with the number $k$ in $\epsilon=k~\sigma$. This is easily settled. Earlier on, we defined the **signal to noise ratio** as snr$= \mu / \sigma_{\mu}$. We indicated that this implicitly corresponds to a $1~\sigma$ error bar: Measurement = $\hat{\mu} \pm \sigma_{\mu}$. We now consider: Measurement = $\hat{\mu} \pm \epsilon_{\mu} = \hat{\mu} \pm k~\sigma_{\mu}$. We re-define (more properly) the signal to noise ratio as: snr$= \mu / \epsilon_{\mu}$ where $\epsilon_{\mu} = k~\sigma_{\mu}$. Thus $\sigma_{\mu}$ can be expressed as: $\sigma_{\mu} = \mu / (k$snr). Applying the same defintion as before of numerical precision $\nu = \log_{10}$ snr, we also get: $\sigma_{\mu} = \mu / (k 10^\nu)$. And now, we can again apply the formula $\sigma_{\mu} = \sigma / \sqrt{n}$ and solve for $n$ to get: Bottomline: **$n$ grows only with the square of $k$, but it grows exponentially with $\nu$**. +++ Let us pose: $n_{11}=\sigma^2 / \mu^2$ We have the following table of growth of $n$ with the number of significant digits $\nu=$ and the number of sigmas in the error bar $k$: | | $\nu=0$ |$\nu=1~digit$ | $\nu=2~digits$ | $\nu=3~digits$ | | :--- | :----: | :----: | :----: | ---: | | $\epsilon= 1~\sigma_{\mu}$ | $~~~~~~n_{11}$ | $100~n_{11}$ | $10000~n_{11}$ | $10^6~n_{11}$ | $\epsilon= 2~\sigma_{\mu}$ | $~~~~4~n_{11}$| $400~n_{11}$ | $40000~n_{11}$ | $4~10^6~n_{11}$ | | $\epsilon= 3~\sigma_{\mu}$ | $~~~~9~n_{11}$ | $900~n_{11}$ | $90000~n_{11}$ | $9~10^6~n_{11}$ We typically recommend to use: $n=100~\sigma^2 / \mu^2$ which corresponds to only $1~\sigma_{\mu}$ confidence interval and $\nu=1$ significant digit, which is the least we can do to get descent error bars. In the particular case of the Bernouilli distribution with parameter $p$, we have $\mu = p$ and $\sigma^2 = p(1-p)$. Thus if $p \ll 1$, we find again: $n \simeq 100 /p$ +++ ### Bootstrap resampling In the previous we found a simple scaling law for the number of examples needed to get a good precision of the mean. However, we made the assumption that the sample mean was (nearly) distributed with the Normal law. This is actually not so bad if the sample are drawn from the Bernoulli distribution, because the sum of Bernouilli random variables is distributed according to the Binomial law, which can be approximated with the Normal law for (relatively) large values of $n$. But consider again the distribution of final exam grades: {code-cell} ipython3 gdf[col].hist(density=True) plt.xlabel(col) plt.ylabel('Density')  It has a "wierd" distribution. How could we know whether the distribution of its mean values (for different students, but similarly distributed) would be "Normal"? it is probably not normal. The idea othe the bootstrap is to sample $n$ times (with replacement) examples from the $n$ empirical values we have to obtain a "new" bootstrap sample, whose distribution is similar to that of our original sample. We do that $b$ to get $b$ bootstrap samples. Each of them has $n$ examples (with some repetitions). For each sample of size $n$, we can calculate its mean and see how it is distributed. Let's give it a try! {code-cell} ipython3 def mean_bootstrap_samples(df, b=100000): # take a df series and sample n values with replacement b times # return a vector of mean values over the b bootstraps n = len(df) DF = np.random.choice(df, size =(n, b)) MU = np.mean(DF, axis=0) return MU  {code-cell} ipython3 bs = mean_bootstrap_samples(gdf[col])  {code-cell} ipython3 mu, sigma = bs.mean(), bs.std() # mean and standard deviation count, bins, ignored = plt.hist(bs, 30, density=True) # make the histogram plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * # plt the normal density np.exp( - (bins - mu)**2 / (2 * sigma**2) ), linewidth=2, color='r') plt.xlabel("Mean " + col) plt.ylabel('Density')  Morality: the density of the mean really looks like the Normal density. {code-cell} ipython3