{ "cells": [ { "cell_type": "markdown", "id": "governmental-avatar", "metadata": {}, "source": [ "# Lecture 1: Crash course in statistics\n", "\n", "We try to avoid frustrating you with too much probability and statistics. Concentrate to get the basics, so we can quickly get to the more fun stuff! What you need to absolutely know is in this notebook:\n", "1. Mean (moyenne), range, variance, standard deviation (écart type)\n", "1. Median, quartiles, percentiles (or quantiles)\n", "1. Scatter plots, covariance and correlation\n", "1. Random variable, Histograms, Probability Density Function (pdf)\n", "1. Pair plots\n", "1. Normal distribution and confidence intervals (error bars)\n", "1. Variance of the mean, standard error, numerical precision, and sample size" ] }, { "cell_type": "code", "execution_count": null, "id": "realistic-audio", "metadata": {}, "outputs": [], "source": [ "import numpy as np # Library useful to do matrix algebra\n", "import pandas as pd # Library useful to manipulate data tables (data frames) and visualize data\n", "import seaborn as sns; sns.set() # Note useful data visualization functions\n", "from scipy.stats import bernoulli# Errors made are like coin flips, the follow a Bernoulli distribution \n", "import matplotlib.pyplot as plt # Library to make graphs \n", "import matplotlib.pylab as pylab # Set font sizes as desired\n", "import matplotlib.ticker as ticker\n", "params = {'legend.fontsize': 'x-large',\n", " 'figure.figsize': (20, 10),\n", " 'axes.labelsize': 'xx-large',\n", " 'axes.titlesize':'xx-large',\n", " 'xtick.labelsize':'xx-large',\n", " 'ytick.labelsize':'xx-large'}\n", "pylab.rcParams.update(params)\n", "# Command to insert the graphs in line in the notebook:\n", "%matplotlib inline \n", "\n", "from platform import python_version\n", "print(python_version()) # Check which version of Python you are running" ] }, { "cell_type": "markdown", "id": "voluntary-exhibition", "metadata": {}, "source": [ "## 0. Load and visualize data\n", "To get some concrete real example, we weighted 10 apples and 10 bananas (in grams). We also measured the length of the 10 bananas (in cm). Note that for a given index, there is no relation between the apple weight and the banana weight, but the banana weight and length are measured for the same banana.\n", "\n", "We load the data as a Pandas dataframe." ] }, { "cell_type": "code", "execution_count": null, "id": "regular-arizona", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"AppleBanana.csv\")" ] }, { "cell_type": "markdown", "id": "dressed-powder", "metadata": {}, "source": [ "We use a nice feature of Pandas that allows us to visualize tables as a heat map. It appears immediately visually that AppleWeight and BananaWeight are not correlated (as expected), but BananaWeight and BananaLength seem to be correlated. Another observation is that there are two clusters of 5 apples, the first one is be composed of lighter apples than the second one." ] }, { "cell_type": "code", "execution_count": null, "id": "monthly-switzerland", "metadata": {}, "outputs": [], "source": [ "df.style.background_gradient(cmap='Blues')" ] }, { "cell_type": "markdown", "id": "ecological-focus", "metadata": {}, "source": [ "If we use the function describe() of Pandas, we get a bunch of statistics, le us see whether we understand what those are." ] }, { "cell_type": "code", "execution_count": null, "id": "subtle-reference", "metadata": {}, "outputs": [], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "id": "foster-specialist", "metadata": {}, "source": [ "## 1. Mean, variance, and standard deviation\n", "We are going the use the banana weights to illustrate the notions of Mean (moyenne), variance, and standard deviation (écart type).\n", "\n", "First comes first: **we VISUALIZE the weights of the bananas on a line.**" ] }, { "cell_type": "code", "execution_count": null, "id": "romance-findings", "metadata": {}, "outputs": [], "source": [ "def show_series(df, series='BananaWeight'):\n", " SortedBananaWeight = df[series].sort_values()\n", " fig = plt.figure(figsize=(20, 1))\n", " plt.plot(SortedBananaWeight, np.ones(len(SortedBananaWeight)), 'go')\n", " plt.xlabel('%s' % series)\n", " ax = plt.gca()\n", " ax.axes.yaxis.set_visible(False)\n", " tick_spacing = 1\n", " ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))\n", "show_series(df)" ] }, { "cell_type": "markdown", "id": "private-washington", "metadata": {}, "source": [ "### Mean:\n", "\n", "
\n", "\n", "MEAN: Sum of values divided by number of values. \n", "\n", "
\n", "\n", "In mathematical notations, if the weights of the bananas are denoted:\n", "$x_1 = 149, x_2 = 154, \\cdots, x_n = 151$, the mean will be:\n", "\n", "\n", "\n", "$\\mu = \\frac{1}{10} (x_1 + x_2 + \\cdots + x_n) ~~= ~~\\frac{1}{n}\\sum_{i=1}^n x_i$\n", " \n", "\n", "\n", "that is:\n", "\n", "$\\mu = \\frac{1}{10} (149 + 154 + \\cdots + 151)$\n", "\n", "Let us check that and verify the we obtain the same value as provided by df.describe()." ] }, { "cell_type": "code", "execution_count": null, "id": "arctic-punishment", "metadata": {}, "outputs": [], "source": [ "mu = (1/10) * df['BananaWeight'].sum()\n", "assert(mu== df['BananaWeight'].mean())\n", "mu" ] }, { "cell_type": "markdown", "id": "expanded-footage", "metadata": {}, "source": [ "We can now place the mean as a red vertical bar in the plot we made previously." ] }, { "cell_type": "code", "execution_count": null, "id": "coral-smith", "metadata": {}, "outputs": [], "source": [ "show_series(df)\n", "plt.axvline(mu, alpha = 1, ymax = 1, linestyle = '-', color='red')\n", "plt.title(\"Mean = red vertical bar\")" ] }, { "cell_type": "markdown", "id": "solved-marriage", "metadata": {}, "source": [ "### Range, variance and standard deviation\n", "We now need a measure that summarizes the dispersion of points (voir **[indicateurs de dispersion](https://fr.wikipedia.org/wiki/Indicateur_de_dispersion)**). We could use the **range** (en français: étendue). The range is define as the difference between the maximum (largest) and the minimum (smallest) values:\n", "\n", "rng$= \\max_i(x_i) - \\min_i(x_i) = 166-142 = 24$" ] }, { "cell_type": "code", "execution_count": null, "id": "convertible-payment", "metadata": {}, "outputs": [], "source": [ "rng = df['BananaWeight'].max()-df['BananaWeight'].min()\n", "rng" ] }, { "cell_type": "markdown", "id": "southeast-wireless", "metadata": {}, "source": [ "Another possibility would be to compute the **[mean absolute deviation](https://en.wikipedia.org/wiki/Average_absolute_deviation)** (MAD) (en français: écart moyen) defined as:\n", "\n", "mad$= \\frac{1}{n} \\sum_{i=1}^n |x_i - \\mu|$\n", "\n", "that is the average of the distances between the $b_i$ values and the mean $\\mu$. Notice that if you omitted the absolute value, you would get: $\\frac{1}{n} \\sum_{i=1}^n (x_i - \\mu) = \\frac{1}{n} \\sum_{i=1}^n x_i - \\frac{1}{n} n \\mu = \\mu - \\mu = 0$, which is not very useful." ] }, { "cell_type": "code", "execution_count": null, "id": "stainless-aluminum", "metadata": {}, "outputs": [], "source": [ "mad = df['BananaWeight'].mad()\n", "mad" ] }, { "cell_type": "markdown", "id": "prerequisite-glucose", "metadata": {}, "source": [ "But more commonly, people use the **[standard deviation](https://en.wikipedia.org/wiki/Standard_deviation)** (std ou stdev) (en français: écart type), which is the square root of the **[variance](https://en.wikipedia.org/wiki/Variance)** (en français: variance). The variance is defined as:\n", "\n", "
\n", " \n", "VARIANCE: Mean Square Distance to the Mean.\n", " \n", "
\n", "\n", "
\n", " \n", "VARIANCE (Français): La moyenne des carrés des écarts à la moyenne.\n", " \n", "
\n", "\n", "\n", "\n", "var$~ = \\sigma^2 = \\frac{1}{n} \\sum_{i=1}^n (x_i - \\mu)^2$.\n", " \n", "\n", "\n", "So the only difference between mad$_i$ and var$_b$ is that we SQUARE the differences $(x_i - \\mu)$ instead of taking their absolute values. Then to get the standard deviation, on just needs to take the square root:\n", "\n", "
\n", " \n", "STANDARD DEVIATION: Root Mean Square Distance to the Mean.\n", "\n", "
\n", "\n", "
\n", " \n", "ECART TYPE (Français): La racine carrée de la moyenne des carrés des écarts à la moyenne.\n", " \n", "
\n", "\n", "\n", " \n", "std$~ = \\sigma = \\sqrt{\\frac{1}{n} \\sum_{i=1}^n (x_i - \\mu)^2}$.\n", " \n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "extra-tourist", "metadata": {}, "outputs": [], "source": [ "var = df['BananaWeight'].var()\n", "var" ] }, { "cell_type": "code", "execution_count": null, "id": "spread-linux", "metadata": {}, "outputs": [], "source": [ "std = df['BananaWeight'].std()\n", "std" ] }, { "cell_type": "code", "execution_count": null, "id": "korean-korea", "metadata": {}, "outputs": [], "source": [ "assert(std==np.sqrt(var))" ] }, { "cell_type": "markdown", "id": "speaking-decline", "metadata": {}, "source": [ "Let us plot now as dashed red vertical bars $(\\mu - \\sigma)$ and $(\\mu + \\sigma)$:" ] }, { "cell_type": "code", "execution_count": null, "id": "partial-explanation", "metadata": {}, "outputs": [], "source": [ "def show_mean_std(df, series = 'BananaWeight', two_sigma=False):\n", " dfb = df[series]\n", " mu = dfb.mean()\n", " std = dfb.std()\n", " plt.axvline(mu, alpha = 1, ymax = 1, linestyle = '-', color='red')\n", " plt.axvline(mu-std, alpha = 1, ymax = 1, linestyle = '--', color='red')\n", " plt.axvline(mu+std, alpha = 1, ymax = 1, linestyle = '--', color='red')\n", " if two_sigma:\n", " plt.axvline(mu-2*std, alpha = 1, ymax = 1, linestyle = '--', color='blue')\n", " plt.axvline(mu+2*std, alpha = 1, ymax = 1, linestyle = '--', color='blue')\n", " \n", "show_series(df)\n", "show_mean_std(df)\n", "plt.title(\"(mean + std) and (mean - std) = red dashed vertical bars\")" ] }, { "cell_type": "markdown", "id": "proprietary-adelaide", "metadata": {}, "source": [ "## 2. Median, quartiles, percentiles (or quantiles)\n", "So far we have explained in df.describe() mean and std. We also found that min and max are the smallest and largest values (respectively) and that they are used to compute the range rng=max-min.\n", "\n", "Next we need to explain the lines marqued as 25\\%, 50\\%, and 75\\%. These are called **[percentiles](https://en.wikipedia.org/wiki/Percentile)**. \n", "\n", "
\n", " \n", "$P^{th}$ PERCENTILE: Threshold below which P% of the values are found.\n", "\n", "
\n", "\n", "Percentiles are expressed in percent (beween 0 and 100) while quantiles are expressed in probability values (between 0 and 1); they are otherwise the same notion:\n", "\n", "
\n", " \n", "$P^{th}$ QUANTILE: Threshold below which a fraction P of the values are found.\n", "\n", "
\n", "\n", "There are some special cases pf percentiles that have particular names:\n", "
\n", " \n", "LOWER QUARTILE: $25^{th}$ percentile.
\n", "MEDIAN = MIDDLE QUARTILE: $50^{th}$ percentile.
\n", "UPPER QUARTILE: $75^{th}$ percentile.
\n", "\n", "
\n", "\n", "The **median** splits the examples in two halves with the same number of examples on either side. The **quartiles** split the examples in 4 equal parts." ] }, { "cell_type": "code", "execution_count": null, "id": "attempted-advancement", "metadata": {}, "outputs": [], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "id": "blind-store", "metadata": {}, "source": [ "Let us plot now:\n", "* as a solid green vertical bars the median,\n", "* as dashed green vertical bars the upper and lower quartiles,\n", "* as dashed black vertical bars the $5^{th}$ and $95^{th}$ percentiles." ] }, { "cell_type": "code", "execution_count": null, "id": "shaped-fortune", "metadata": {}, "outputs": [], "source": [ "def show_percentiles(df, series = 'BananaWeight', tail=0.10):\n", " dfb = df[series]\n", " # Calculate percentiles\n", " quant_L, quant_25, quant_50, quant_75, quant_U = dfb.quantile(tail), dfb.quantile(0.25), dfb.quantile(0.5), dfb.quantile(0.75), dfb.quantile(1-tail)\n", " print('QL=%5.2f, Q25=%5.2f, Q50=%5.2f, Q75=%5.2f, QU=%5.2f,' % (quant_L, quant_25, quant_50, quant_75, quant_U))\n", " \n", " # Plot the green lines\n", " plt.axvline(quant_25, alpha = 1, ymax = 1, linestyle = '--', color='green')\n", " plt.axvline(quant_50, alpha = 1, ymax = 1, linestyle = '-', color='green')\n", " plt.axvline(quant_75, alpha = 1, ymax = 1, linestyle = '--', color='green')\n", "\n", " # Plot the black lines\n", " plt.axvline(quant_L, alpha = 1, ymax = 1, linestyle = '--', color='black')\n", " plt.axvline(quant_U, alpha = 1, ymax = 1, linestyle = '--', color='black')\n", " \n", "show_series(df)\n", "show_percentiles(df)\n", "plt.title(\"Median = solid green. Quartiles = dashed green. 10th and 90th percentiles = dashed black.\")" ] }, { "cell_type": "markdown", "id": "basic-houston", "metadata": {}, "source": [ "First we can check that the potition of the 25, 50, and 75 percentiles in describe() coincide with those we found using quantile(). Second, we can be surprised of where the percentile threshold lines end up. If you read the [Pandas documentation on quantiles](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html), this is due to the fact that, by default, some linear interpolation is performed. You can play with the various options to see how the position of the vertical lines is affected." ] }, { "cell_type": "markdown", "id": "moral-belize", "metadata": {}, "source": [ "Let us now show mean and std and percentiles on top of each other." ] }, { "cell_type": "code", "execution_count": null, "id": "intermediate-graph", "metadata": {}, "outputs": [], "source": [ "show_series(df)\n", "show_mean_std(df)\n", "show_percentiles(df)\n", "plt.title(\"Mean, std and percentiles overlayed.\")" ] }, { "cell_type": "markdown", "id": "daily-recipient", "metadata": {}, "source": [ "## 3. Scatter plots, covariance, and correlation\n", "In a data frame, a column is called a \"series\". It can be interesting to plot one series vs. another. Let us plot BananaLength vs. BananaWeight, and compare it with AppleWeight vs. BananaWeight." ] }, { "cell_type": "code", "execution_count": null, "id": "aboriginal-payroll", "metadata": {}, "outputs": [], "source": [ "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 7))\n", "df.plot(x='BananaWeight', y='BananaLength', kind=\"scatter\", ax=ax1, color='g', s = 50)\n", "df.plot(x='BananaWeight', y='AppleWeight', kind=\"scatter\", ax=ax2, color='r', s = 50)" ] }, { "cell_type": "markdown", "id": "sealed-disabled", "metadata": {}, "source": [ "In the left plot, the points do no look super well aligned, still there seems a trend: the longer the banana, the heavier. In the right plot, as can be expected, there is no trend. **We can actually QUANTIFY that by measuring how the series co-vary (vary together):**" ] }, { "cell_type": "markdown", "id": "japanese-engineer", "metadata": {}, "source": [ "### Covariance\n", "Remember, we defined variance as the Mean Square Distance to the Mean, i.e.\n", "var$= \\sigma_x^2 = \\frac{1}{n} \\sum_{i=1}^n (x_i - \\mu_x)^2$.\n", "\n", "By analogy with the variance, we define the **[co-variance](https://en.wikipedia.org/wiki/Covariance)** as:\n", "\n", "
\n", " \n", "COVARIANCE: Mean Joint Product of Distance to the Mean.\n", " \n", "
\n", "\n", "So given two series $\\{x_i\\}$ and $\\{y_i\\}$, the **covariance** is defined as:\n", "\n", "\n", "\n", "$\\sigma_{xy} = \\frac{1}{n} \\sum_{i=1}^n (x_i - \\mu_x)(y_i - \\mu_y)$.\n", " \n", "\n", "\n", "We show below the **covariance matrix**, i.e. the matrix of all covariances of pairs of series in a data frame. On the diagonal, we have $\\sigma^2$, the variance. Off diagonal, we have the covariances $\\sigma_{xy}$. Check that the variance of BananaWeight is the same as found above." ] }, { "cell_type": "code", "execution_count": null, "id": "functional-columbus", "metadata": {}, "outputs": [], "source": [ "df.cov()" ] }, { "cell_type": "markdown", "id": "actual-meeting", "metadata": {}, "source": [ "
\n", " \n", "WARNING: The magnitude of the covariance coefficients are not easy to interpret, because they are not normalized, and hence **they depend on the magnitudes of the variables** (which can be expressed in different units, e.g. the length could be expressed in mm instead of cm and the weight in kg instead of g). Here we see for example that covariance(AppleWeight, BananaWeight) > covariance(BananaLength, BananaWeight), which is unexpected since AppleWeight and BananaWeight are NOT related, whereas BananaLength and BananaWeight ARE related!\n", "\n", "
\n", "\n", "### Correlation\n", "\n", "The **[Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)** is a normalized version of the covariance (**so it solves the above problem of scale**).\n", "\n", "
\n", " \n", "CORRELATION: Covariance normalized by product of standard deviations.\n", " \n", "
\n", "\n", "Mathematically:\n", "\n", "\n", " \n", "$\\rho_{xy} = \\frac{\\sigma_{xy}}{\\sigma_x \\sigma_y}$\n", "\n", "\n", "\n", "The magnitude of $\\rho_{xy}$ represents the strength of the linear relation between the two series. Its values are between -1 and 1. Zero means no correlation, 1 means perfect correlation (points aligned, positive slope) and -1 means perfect anti-correlation (points aligned, negative slope).\n", "\n", "The **correlation matrix** is the matrix of all correlations of pairs of series in a data frame. On the diagonal, we have $\\rho_{xx}$ values, which are always 1 (as expected from the definition of the correlation coefficient)." ] }, { "cell_type": "code", "execution_count": null, "id": "preliminary-swing", "metadata": {}, "outputs": [], "source": [ "df.corr()" ] }, { "cell_type": "markdown", "id": "understanding-leadership", "metadata": {}, "source": [ "We are now reassured: we see that correlation(AppleWeight, BananaWeight) < correlation(BananaLength, BananaWeight), which is what we expected since AppleWeight and BananaWeight are NOT related, whereas BananaLength and BananaWeight ARE related.\n", "\n", "To visualize the correlation matrix, we can tint the boxes and turn it into a heat map:" ] }, { "cell_type": "code", "execution_count": null, "id": "italian-devices", "metadata": {}, "outputs": [], "source": [ "df.corr().style.background_gradient(cmap='Blues')" ] }, { "cell_type": "markdown", "id": "smaller-escape", "metadata": {}, "source": [ "Unfortunately, as can be seen, the coloring of boxes for the same numerical value is different in the different columns. This is due to the fact that Pandas spans the color map between the smallest and the largest value in each column. To overcome this problem, we can visualize the correlation matrix with Seaborn:" ] }, { "cell_type": "code", "execution_count": null, "id": "measured-thought", "metadata": {}, "outputs": [], "source": [ "sns.heatmap(df.corr(), fmt='0.2f', annot=True, square=True)" ] }, { "cell_type": "markdown", "id": "above-information", "metadata": {}, "source": [ "## 4. Random variables, histograms, and PDF\n", "\n", "Consider a quantity like BananaWeight. We can consider the production of bananas as a random process. Given a very large number of banana, if we pick one at random, it will have a given weight. Each banana has a slightly different weight. We call BananaWeight a **random variable** (variable aléatoire). Random variables are often denoted with **capital letters**, such as $X$, and particular measurements or observations (called \"realizations\") are denoted with small letters, such as $x$.\n", "\n", "If we draw a very large number of bananas, the plots we made in section 1 (dots aligned on one axis), will become impossible to read (the dots will fall on one another). We will need another means of visualizing data. We are going to introduce **histograms**. \n", "\n", "
\n", "\n", "HISTOGRAM: To make a histogram, we split the x-axis in class intervals (called bins or buckets). We then make diagram consisting of rectangles whose area is proportional to the frequency of occurences in the interval and whose width is equal to the class interval.\n", "\n", "
\n", "\n", "This is a way of representing as a bar graph Proba($X \\in bin$). If the number of examples grows to infinity and, simultaneously, the width of the bins goes to zero, a histogram becomes a density $\\rho($X = x$)$." ] }, { "cell_type": "markdown", "id": "peaceful-economy", "metadata": {}, "source": [ "### Grades info 111\n", "As a first real life example, let us make histgrams of the grades of info 111. We replaced missing values by 0." ] }, { "cell_type": "code", "execution_count": null, "id": "another-jenny", "metadata": {}, "outputs": [], "source": [ "gdf = pd.read_csv(\"GradesInfo111.csv\")\n", "gdf.describe()" ] }, { "cell_type": "code", "execution_count": null, "id": "conceptual-cover", "metadata": {}, "outputs": [], "source": [ "gdf.style.background_gradient(cmap='Blues')" ] }, { "cell_type": "code", "execution_count": null, "id": "modified-locator", "metadata": {}, "outputs": [], "source": [ "col = gdf.columns\n", "sns.heatmap(gdf.sort_values(by=col).transpose())" ] }, { "cell_type": "markdown", "id": "regulation-generation", "metadata": {}, "source": [ "Let us check one of the series, that of the third column:" ] }, { "cell_type": "code", "execution_count": null, "id": "absent-tactics", "metadata": {}, "outputs": [], "source": [ "show_series(gdf, series=col)\n", "show_mean_std(gdf, series=col)\n", "show_percentiles(gdf, series=col)" ] }, { "cell_type": "markdown", "id": "distant-quantum", "metadata": {}, "source": [ "As can be seen, many points fall on top of one another. Hence it is difficult to visualize the distribution of grades in this way. Let us make a histogram." ] }, { "cell_type": "code", "execution_count": null, "id": "common-invention", "metadata": {}, "outputs": [], "source": [ "gdf[col].hist()\n", "plt.xlabel(col)\n", "plt.ylabel('Counts')\n", "show_mean_std(gdf, series = col)\n", "show_percentiles(gdf, series = col)\n", "plt.title(\"Mean = solid red. Mean +- std = dashed red. Median = solid green. Quartiles = dashed green. 10th and 90th percentiles = dashed black.\")" ] }, { "cell_type": "markdown", "id": "altered-punch", "metadata": {}, "source": [ "Now it is easier to see that there is a higher density of good grades, except for a cluster of people at very low grade (probably people who did not show up at the exam)." ] }, { "cell_type": "markdown", "id": "developing-queen", "metadata": {}, "source": [ "### Multiple histograms\n", "Pandas allows you to easily draw simultaneously the histograms of all series." ] }, { "cell_type": "code", "execution_count": null, "id": "separate-radical", "metadata": {}, "outputs": [], "source": [ "gdf.hist(figsize=(20, 20))" ] }, { "cell_type": "markdown", "id": "flexible-formula", "metadata": {}, "source": [ "## 5. Pair plots\n", "When we have several series, it is convenient to show simultaneously histograms and scatter plots. Be patient, this is SLOW. On the diagonal you will see the histograms for the various series. Off diagonal you will see the scatter plots of all pairs of series." ] }, { "cell_type": "code", "execution_count": null, "id": "aquatic-quantum", "metadata": {}, "outputs": [], "source": [ "sns.pairplot(gdf)" ] }, { "cell_type": "markdown", "id": "hispanic-sperm", "metadata": {}, "source": [ "As you can see, several series are very correlated (points nearly aligned). We can quantify that with the correlation matrix." ] }, { "cell_type": "code", "execution_count": null, "id": "virgin-lending", "metadata": {}, "outputs": [], "source": [ "sns.heatmap(gdf.corr(), fmt='0.2f', annot=True, square=True)" ] }, { "cell_type": "markdown", "id": "steady-election", "metadata": {}, "source": [ "## 6. Normal Law and confidence intervals\n", "\n", "### Normal \"Law\" (\"Loi\" Normale)\n", "\n", "Let us go back to our banana example. We were too lazy to weigh a large number of bananas. So we decided to simulate drawing 1000 examples of BananaWeight using the Normal law, but **keeping the mean and standard deviation of our (small) real sample**. Formally, we say the BananaWeight is a random variable distributed according to the **[Normal distribution](https://en.wikipedia.org/wiki/Normal_distribution)** or **Gaussian distribution**." ] }, { "cell_type": "code", "execution_count": null, "id": "future-house", "metadata": {}, "outputs": [], "source": [ "def draw_normal_bananas(df, series = 'BananaWeight', num=1000):\n", " # We draw num examples using the Normal law, using the mean and std of df[series]\n", " # By fixing the ramdom seed, we get the same sample every time we run the code\n", " np.random.seed(seed=1234) # scipy.stats uses numpy.random to generate its random numbers\n", " mu, sigma = df[series].mean(), df[series].std() # mean and standard deviation\n", " s = np.random.normal(mu, sigma, num) # draw 1000 bananas\n", " count, bins, ignored = plt.hist(s, 30, density=True) # make the histogram\n", " plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * # plot the Normal density\n", " np.exp( - (bins - mu)**2 / (2 * sigma**2) ),\n", " linewidth=2, color='r')\n", " plt.xlabel(series)\n", " plt.ylabel('Density')\n", " return pd.DataFrame(s, columns = [series])\n", " \n", "draw_normal_bananas(df)" ] }, { "cell_type": "markdown", "id": "blank-tunnel", "metadata": {}, "source": [ "The red line represents the **density** of bananas having a given weight. The blue bars represent the histogram obtained for 1000 bananas drawn randomly according the a Normal distribution with mean and standard deviation that are given by our emplirical sample of 10 bananas, used before.\n", "\n", "The height of the bars is proportional to the number of bananas that fall in the interval of weight defined by the width of the bar." ] }, { "cell_type": "markdown", "id": "paperback-statistics", "metadata": {}, "source": [ "### Mean, median, stdev, quartiles, and other quantiles, or percentiles\n", "As before, we can plot as vertical lines the mean variance the mean, median, stdev, quartiles, and other quantiles, or percentiles. But we chose different quantiles for the black dashed lines. The first one uses P=0.025=2.5% and the last one P=0.975=97.5%. In this way, we get 95% of the probability mass between the two black dashed lines." ] }, { "cell_type": "code", "execution_count": null, "id": "close-washington", "metadata": {}, "outputs": [], "source": [ "df_normal = draw_normal_bananas(df)\n", "show_mean_std(df_normal, two_sigma=True)\n", "show_percentiles(df_normal, tail=0.025)" ] }, { "cell_type": "markdown", "id": "different-airline", "metadata": {}, "source": [ "For the **[Normal distribution](https://towardsdatascience.com/understanding-the-68-95-99-7-rule-for-a-normal-distribution-b7b7cbf760c2)**:\n", "* The mean and the median coincide.\n", "* We have beween $\\mu - \\sigma$ and $\\mu - \\sigma$ (red dashed lines) 68% of the probability mass. The is more than between the quartiles (green dashed lines): 50% of the probability mass, by definition.\n", "* We have beween $\\mu - 2\\sigma$ and $\\mu - 2\\sigma$ (blue dashed lines) 95% of the probability mass (this is verified by showing as black dashed lines the 2.5 percentile).\n", "* The Normal distribution is completely characterized by only two parameters: $\\mu$ and $\\sigma$ (sufficient statistics)." ] }, { "cell_type": "markdown", "id": "fuzzy-perth", "metadata": {}, "source": [ "### Confidence intervals\n", "If we want to give a an idea of a typical BananaWeight, we can just provide the mean (point estimation), or we can give an interval in which it is probable to find the weight of a typical banana.\n", "\n", "For the Normal distribution, we have:\n", "* Proba$(\\mu-\\sigma\n", " \n", "ERROR BAR: There are multiple definitions of error bars. \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " When we writeWe mean Measurement =$\\mu \\pm \\sigma$A 68% CI, for data Normally distributed. Proba$(\\mu-\\sigma\n", "
Measurement = $\\mu \\pm 2 \\sigma$A 95% CI, for data Normally distributed. Proba$(\\mu-2\\sigma\n", " Measurement =$[L, U]$, 95% CI$L$is the lower bound and$U$the upper bound of the 95% CI \n", "\n", "" ] }, { "cell_type": "markdown", "id": "placed-flashing", "metadata": {}, "source": [ "To get values for the three series in our fruit example, while **respecting the covariance between the series**, we can draw examples using the Normal multivariate distribution, using the covariance matrix of our (small) real data sample." ] }, { "cell_type": "code", "execution_count": null, "id": "settled-photograph", "metadata": {}, "outputs": [], "source": [ "s = np.random.multivariate_normal(df.mean(), df.cov(), 1000)\n", "df_multi = pd.DataFrame(s, columns = ['AppleWeight', 'BananaWeight', 'BananaLength'])\n", "sns.pairplot(df_multi)" ] }, { "cell_type": "markdown", "id": "exterior-toner", "metadata": {}, "source": [ "We can then check that the correlation matrix of the articial data resembles that of the real data." ] }, { "cell_type": "code", "execution_count": null, "id": "liquid-massachusetts", "metadata": {}, "outputs": [], "source": [ "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))\n", "sns.heatmap(df_multi.corr(), fmt='0.2f', annot=True, square=True, ax=ax1)\n", "ax1.set_title('Artificial data')\n", "sns.heatmap(df.corr(), fmt='0.2f', annot=True, square=True, ax=ax2)\n", "ax2.set_title('Real data')" ] }, { "cell_type": "markdown", "id": "ranging-incidence", "metadata": {}, "source": [ "## 6. Variance of the mean, standard error, error bars, numerical precision" ] }, { "cell_type": "markdown", "id": "automatic-needle", "metadata": {}, "source": [ "Often, what we are interested in is not the distribution of a quantity like BananaWeight, but we want to estimate as precisely as possible the true MEAN of the quantity (its mathematical expectation). The larger the sample size n, the more precisely we will estimate the mean. In fact, the **variance of the mean is inversely proportional to n**.\n", "\n", "### Variance of the mean and standard error\n", "We call$\\mu$the true mean and$\\hat{\\mu} = (1/n) (\\sum_{i=1}^n X_i)$the estimator of the empirical mean (computed with only n examples).\n", "\n", " \n", " \n", "**WARNING:** We do not always make a distinction between:\n", "* the \"true\" mean of a random variable X$\\mu$, also called mathematical expectation, (which is obtained by averaging an infinite number of examples, drawn according to the distribution of X), and\n", "* the empirical mean$\\hat{\\mu}$(which is obtained by averaging an FINITE number of examples).\n", "\n", "unless it is necessary because we manipulate both at the same time.\n", "\n", " \n", "Note that$\\hat{\\mu}$is a **random variable** and as such it has a variance. We obtain different values of$\\hat{\\mu}$every time we draw randomly (and independently) n values of X over which we average.\n", " \n", " \n", "\n", " \n", " \n", "VARIANCE OF THE MEAN: var$(\\hat{\\mu}) = \\sigma_{\\mu}^2 = \\sigma^2/n$\n", "\n", " \n", "\n", " \n", " \n", "**PROOF:** You can skips this proof, but if you are interested, read on, it is not that hard.\n", "\n", "Consider n random variables (RV) distributed identically, but independent of one another:$X_1, X_2 \\cdots X_n$, of mean$\\mu$and of variance$\\sigma$. The variance of the sum will be the sum of the variances (search for \"Sum of uncorrelated variables\" in the [Wikipedia page on variance](https://en.wikipedia.org/wiki/Variance)): \n", "\n", "$var(\\sum_{i=1}^n X_i) = \\sum_{i=1}^n var(X_i) = n \\sigma^2$. \n", "\n", "The empirical mean is \n", "\n", "$\\hat{\\mu} = (1/n) (\\sum_{i=1}^n X_i)$.\n", "\n", "We put a hat on top of the empirical mean to distiguish it from the true mean$\\mu$. We also know that when a RV is scaled by a constant$a$, its variance is multiplied by$a^2$(search for \"Basic properties\" in the [Wikipedia page on variance](https://en.wikipedia.org/wiki/Variance)). So:\n", "\n", "var$(\\hat{\\mu})=(1/n)^2 var(\\sum_{i=1}^n X_i) = (1/n)^2 n \\sigma^2 = \\sigma^2/n. $\n", "\n", "Therefore the **variance of the mean** is: var$(\\hat{\\mu}) = \\sigma_{\\mu}^2 = \\sigma^2/n$\n", " \n", " \n", "\n", "\n", "\n", "The standard deviation of the mean (also known as **standard error**) is obtained by taking the square root of the variance of the mean:\n", "\n", " \n", " \n", "STANDARD ERROR:$\\sigma_{\\mu} = \\sigma/\\sqrt{n}$\n", " \n", " \n", " \n", "\n", "The **[law of large numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers)**, stipulates that as$n$goes to infinity the empirical mean$\\hat{\\mu}$converges to the \"mathematical expectation\"$\\mu$, and the variance of the mean goes to zero. Taking as example the Normal law and the 95% confidence interval we saw above, we easily see that:\n", " \n", "Proba$(\\mu - \\sigma_{\\mu} < \\hat{\\mu} < \\mu - \\sigma_{\\mu}) = 0.95$\n", "\n", "Proba$(\\mu - \\sigma/\\sqrt{n} < \\hat{\\mu} < \\mu - \\sigma/\\sqrt{n}) = 0.95$\n", "\n", "Hence$\\hat{\\mu}$converges to$\\mu$.\n", " \n", " \n", "\n", "**VERIFICATION:**\n", " We now proceed to a simple empirical verification. We draw$b$samples of$n$values according to the Normal distribution (each one is called a \"batch\" or \"block\"). We compute the means$\\mu_i$of each of these$b$batches,$b = 1 \\cdots n$. Then we compute the empirical variance$\\sigma_{\\mu}$of the means$\\mu_i$, over the$b$batches, that is$\\hat{\\sigma}_{\\mu}^2 = (1/b)\\sum_{i=1}^b (\\mu_i - \\hat{\\mu})^2$, where$\\hat{\\mu} = (1/b)\\sum_{i=1}^b \\mu_i$. We added a hat on$\\sigma_{\\mu}^2$and$\\mu$to remind ourselves that this is only an approximation evaluated on$b$batches, but if we make$b$large it should be good. If the theory is correct, then$\\hat{\\sigma}_{\\mu}^2$should be inversely proportional to$n$. This is easy to visualize on a log-log plot since$\\log \\sigma_{\\mu} = \\log \\sigma - \\log n. Hence, the points should be aligned and the line should have a slope -1 in log-log coordinates.\n", "\n", " " ] }, { "cell_type": "code", "execution_count": null, "id": "small-integration", "metadata": {}, "outputs": [], "source": [ "# For concreteness, we choose as mean and variance that of our empirical sample of banana weights\n", "mu, sigma = df['BananaWeight'].mean(), df['BananaWeight'].std() # mean and standard deviation\n", "# We make a note of the actual sample size\n", "N = len(df['BananaWeight'])\n", "# We make a note of the variance\n", "V = sigma**2\n", "# We select some sample sizes, growing exponentially, to verify the law of large numbers\n", "n = [1, 10,100,1000,10000]\n", "# We select a large batch size\n", "b = 10000\n", "v = []\n", "for num in n:\n", " mu_vec = np.zeros((b,1))\n", " for i in range(b):\n", " # We draw a vector of size num containing fake banana weights\n", " x = np.random.normal(mu, sigma, num) \n", " # We evaluate the mean for this number of bananas (num)\n", " mu_vec[i,0] = x.mean() # We repeat this b times and fill a vector of b mu values\n", " # We compute for each num value the variance of the mu, over the b batches\n", " v = np.append(v, mu_vec.var()) " ] }, { "cell_type": "code", "execution_count": null, "id": "vanilla-discretion", "metadata": {}, "outputs": [], "source": [ "def log_log_plt(n, v, N, V):\n", " # Plot a list of variance values as a function of a list of n values\n", " # Indicate in blue the actual value of N and the variance of mean predicted by the theory\n", " # Assume n=1\n", " plt.plot(n, v)\n", " plt.plot(n, v, 'ro')\n", " plt.plot(N, V/N, 'bo')\n", " ax = plt.gca()\n", " ax.set_xscale('log')\n", " ax.set_yscale('log')\n", " plt.xlabel('Number examplesn$')\n", " plt.ylabel('Variance of mean$\\sigma_{\\mu}^2$')\n", " plt.text(n, v, '$\\sigma^2_{\\mu}=\\sigma^2$', fontsize=18, color='r') # coordinate at origin\n", " plt.text(N, V/N, '$\\sigma^2_{\\mu}=\\sigma^2/N$', fontsize=18, color='b') # coordinate of original N value\n", " plt.title('Variation of$\\sigma^2_{\\mu}$as a function of$n$, for$\\sigma^2=%.0f and N=%d' % (V, N))\n", " \n", "log_log_plt(n, v, N, V)" ] }, { "cell_type": "markdown", "id": "unauthorized-creativity", "metadata": {}, "source": [ "We verify that the points are well aligned, with slope (-1) in the log-log scale, in accordance with: \n", "\n", " \n", "\n", "LAW OF LARGE NUMBERS:\\log \\sigma_{\\mu}^2= \\log \\sigma^2 - \\log n$. \n", "\n", " \n", "\n", "The intercept (coordonnée à l'origine) is$\\log \\sigma^2$. Hence is suffices for us to know (an empirical estimate of)$\\sigma^2$to draw this entire curve, once we know the \"theory\". In what follows we call$\\sigma$the standard deviation estimated with our empirical sample and we assume that$\\sigma_{\\mu}$is computed with$\\sigma_{\\mu} = \\sigma/\\sqrt{n}$, rather than being empirically estimated with$b$samples of the same size$n$(which would not make sense, because if we had$bn$samples, we would use all of them to get a more precise estimation of the mean.\n", "\n", "The base of the log does not matter. Here for convenience we use log$_{10}$." ] }, { "cell_type": "markdown", "id": "killing-breast", "metadata": {}, "source": [ "### Numerical precision and sample size\n", "\n", "The question now is: **How many examples do you need to get a given numerical precision**?\n", "\n", "\n", "There are several ways of defining [numerical precision](http://web.mnstate.edu/marasing/CHEM380/Documents/Statistics%20I%202017.pdf).\n", "In this lecture and for the next few weeks, we will define numerical precision as:\n", "\n", " \n", " \n", "NUMERICAL PRECISION:$\\nu = \\log_{10}$snr.\n", " \n", " \n", "where snr is the signal to noise ratio:\n", "\n", " \n", " \n", "SIGNAL TO NOISE RATIO: snr$ = \\frac{\\hat{\\mu}}{\\sigma_{\\mu}}$.\n", "\n", " \n", "\n", " \n", " \n", "**WARNING:** By doing that, we consider only the **statistical error** also called **random error** and ignore possible **systematic errors** (we will see that in the chapter on bias) and we do not use the [standard comupter science definition of numerical precision](https://en.wikipedia.org/wiki/Precision_(computer_science)). Our notion of precision is that used in **[signal processing](https://en.wikipedia.org/wiki/Signal-to-noise_ratio)** and **[statistics](http://www.dspguide.com/ch2/7.htm)**. It is related to the standard deviation$\\sigma$, the **signal to noise ratio**$snr= \\mu / \\sigma$or its inverse the **coefficient of variation**$cv = \\sigma / \\mu$(or its square, the **efficiency**$eff = (\\sigma / \\mu)^2$). Hence:$\\nu = \\log_{10} snr = - \\log_{10} cv = - (1/2) \\log_{10} eff$.\n", "\n", " \n", " \n", "Now, if we use ONE-sigma bars (68\\% condidence intervals for the Normal distribution) we will report results as:\n", "\n", "Measurement =$\\hat{\\mu} \\pm \\sigma_{\\mu}$\n", "\n", "For example:\n", "\n", "Measurement =$1.00 \\pm 0.01$\n", "\n", "In this case we will have$\\hat{\\mu}/\\sigma_{\\mu}=100$, thus$\\nu=2$and we see that$\\nu=2$corresponds to the number of decimal places. If we have:\n", "\n", "Measurement =$123.000 \\pm 0.123$\n", "\n", "then,$\\hat{\\mu}/\\sigma_{\\mu}=1000$, thus$\\nu=3$, and we will call that our numerical precision. We will keep only ONE trailing non zero number in the error bar and write the result as:\n", "\n", "Measurement =$123.0 \\pm 0.1$\n", "\n", " \n", " \n", " SIGNIFICANT DIGITS: Significant digits of a calculated result$\\hat{\\mu}$are determined by the\n", "**largest non zero digit of the uncertainty**$\\sigma_{\\mu}$, called \"error digit\". We only keep a single error digit in the uncertainty when reporting results. The number of significant digits should be identical to the precision$\\nu$(up to rounding errors) and will be the **number of non zero digits above the \"error digit\"**.\n", " \n", " \n", "\n", "Yet another example:\n", "\n", "Measurement =$0.00318123 \\pm 0.0005$\n", "\n", "then,$\\hat{\\mu}/\\sigma_{\\mu}=6.36$, thus$\\nu=0.8$, we will write the result as:\n", "\n", "Measurement =$0.0032 \\pm 0.0005$\n", "\n", "We sse that we have one non zero digit in$\\hat{\\mu}$above the error digit, hence one significant digit, consistent with the fact that round($\\nu$)=1. Note that we rounded the last digit of$\\hat{\\mu}$.\n", "\n", " \n", " \n", "FIXING THE NUMERICAL PRECISION: If we now want to IMPOSE a given numerical precision$\\nu = log_{10}$snr, we can express$\\sigma_{\\mu}$in two different ways:\n", "\n", "$\\sigma_{\\mu} = \\sigma~/~\\sqrt{n}$\n", "\n", "$\\sigma_{\\mu} = \\hat{\\mu}~/$snr$= \\hat{\\mu}~/~10^{\\nu}$\n", " \n", " \n", "\n", "Solving for n, we get:\n", "\n", " \n", " \n", "SAMPLE SIZE TO GET A GIVEN snr:$~~n = \\frac{\\sigma^2}{\\hat{\\mu}^2}$snr$^2$.\n", " \n", " \n", " \n", "Or equivalently:\n", " \n", " \n", " \n", "SAMPLE SIZE TO GET A NUMERICAL PRECISION snr:$~~n = \\frac{\\sigma^2}{\\hat{\\mu}^2} 10^{2\\nu}$.\n", " \n", " \n", "\n", "So given empirical estimates of the mean and the variance of the quantity of interest, we can compute the number of examples$n$we need to achieve a given precision$\\nu$. Unfortunately, **$n$increases exponentially with$\\nu$.**\n", "Note that technically$n$should be rounded to get an integer value." ] }, { "cell_type": "markdown", "id": "specialized-berkeley", "metadata": {}, "source": [ " \n", " \n", "**BERNOULLI TRIALS:** This formula looks daunting, but we can simplifying in one very useful particular case: If the quantity of interest$X$is just coin flips (e.g. 1 for an error and 0 for a correct answer), with a probability$p$of heads and$(1-p)$of tail. In that case,$X$is known to be distributed according to a **[Bernoulli distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution)** for which$\\mu=p$and$\\sigma=p(1-p)$. Using$\\nu=1$(one significant digit) and$\\hat{\\mu}$as our estimator of$p$, if$p \\ll 1$we get the simplified formula: \n", "\n", "$n p \\simeq 100$\n", " \n", "which is easy to remember. It means that to get \"good\" error bars, we need to see at least 100 errors.\n", "\n", " \n", "\n", " \n", "\n", "**GRAPHICAL REPRESENTATION:**\n", "Let is go back to our graph in log-log coordinates and plot both: \n", " \n", "$\\sigma^2_{\\mu} = \\sigma^2 / n$\n", " \n", "$\\sigma^2_{\\mu} = \\hat{\\mu}^2/10^{2\\nu}$. \n", " \n", " \n", "The intersection of the two will give us the desired numerical precision. In log coordinates, we just need to find the intersection of two lines:\n", " \n", "$\\log_{10} \\sigma_{\\mu}^2= \\log_{10} \\sigma^2 - \\log_{10} n$\n", " \n", "$\\log_{10} \\sigma_{\\mu}^2= \\log_{10} \\hat{\\mu}^2 - 2 \\nu $. \n", " \n", "As a numerical application, let us use our original BananaWeight example.\n", "\n", " " ] }, { "cell_type": "code", "execution_count": null, "id": "soviet-timing", "metadata": {}, "outputs": [], "source": [ "# (1) This part is similar as before, except that we only give the 1st and the last point of the line\n", "# the first one corrsponding to (n=1, sigma^2_mu=sigma^2) and the last (n=n_max, sigma^2_mu=sigma^2/n_max)\n", "# We do not compute sigma^2_mu empirically, we just apply the formula sigma^2/n\n", "# Compute the empirical mean\n", "mu = df['BananaWeight'].mean()\n", "# Compute the empirical variance\n", "V = df['BananaWeight'].var()\n", "# Find original value of n\n", "N = len(df['BananaWeight'])\n", "# Compute variance of the mean for the original sample number N\n", "var_n = V/N\n", "# Compute the original precision\n", "nu_orig = np.round(np.log10(mu/np.sqrt(var_n)))\n", "# Provide the original precision\n", "print('Using N = %d examples, mean(BananaWeight) = %d +- %d (nu = %d)' % (N, np.round(mu), np.round(np.sqrt(var_n)), nu_orig))\n", "# Compute variance of the mean for a vary large value of n (n_max)\n", "n_max = 10000\n", "var_n_max = V/n_max\n", "# Make the log-log plot\n", "log_log_plt([1, n_max], [V, var_n_max], N, V)\n", "\n", "# (2) This part is new. We want to impose a larger number of significant digits.\n", "# We choose a precision nu and compute the number of examples n_nu necessary to get that precision\n", "# Choose precision\n", "nu = 3 # Three significant digits\n", "# Get var_n_nu = mu^2 / 10^2nu\n", "var_new_n = (mu/10**nu)**2\n", "# Get new n value at intersection\n", "new_n = np.round((var/var_new_n))\n", "# Plot horizontal line at var_n_new\n", "plt.plot([1, new_n], [var_new_n, var_new_n])\n", "print(var_new_n, V/new_n, np.sqrt(var_new_n))\n", "print('Using N*= %d examples, mean(BananaWeight) = %5.1f +- %.1f (nu = %d)' % (new_n, mu, np.sqrt(var_new_n), nu))\n", "# Plot vertical bar and show tex for new N value N*\n", "plt.plot([new_n, new_n], [var_new_n, var_n_max], \"--\", color='black')\n", "plt.plot(new_n, var_new_n, 'ko')\n", "plt.text(new_n, var_new_n, '$\\sigma^2_{\\mu}=\\sigma^2/N*$', fontsize=18, color='k') # coordinate of new N value\n", "plt.text(N, var_new_n, '$\\sigma^2_{\\mu}=\\hat{\\mu}^2/10^{2\\\\nu}$', fontsize=18, color='k') # coordinate of new N value\n", "plt.text(new_n/1.5, var_n_max/4, '$N*$', fontsize=18, color='k') # Desired value of N*" ] }, { "cell_type": "markdown", "id": "significant-introduction", "metadata": {}, "source": [ " \n", "\n", "**NUMBER OF EXAMPLES AS A FUNCTION OF$\\nu$:** Taking to log$_{10}$of$~~n = \\frac{\\sigma^2}{\\hat{\\mu}^2} 10^{2\\nu}$we get:\n", " \n", "$\\log_{10}n = 2\\nu + b $,\n", " \n", "with$b = \\log_{10} \\sigma^2 - 2 \\log_{10} \\hat{\\mu}$. So, given empirical estimates of the mean and the variance of the quantity of interest$\\sigma$and$\\hat{\\mu}$(for a given number of examples$N$), we can easily make an abacus to determine the number of examples that would be needed to obtain any desired precision$\\nu$on the estimate of the mean.\n", " " ] }, { "cell_type": "code", "execution_count": null, "id": "australian-cutting", "metadata": {}, "outputs": [], "source": [ "# Compute the empirical mean\n", "mu = df['BananaWeight'].mean()\n", "# Compute the empirical variance\n", "V = df['BananaWeight'].var()\n", "# Compute b\n", "b = np.log10(V) - 2* np.log10(mu)\n", "# Choose first and last point\n", "min_nu = 0\n", "max_nu = 3\n", "min_logn = 2*min_nu +b\n", "max_logn = 2*max_nu +b\n", "min_n = 10**min_logn\n", "max_n = 10**max_logn\n", "# Plot n on a log scale\n", "plt.plot([min_nu, max_nu], [min_n, max_n], \"-\", color='black')\n", "ax = plt.gca()\n", "ax.set_yscale('log')\n", "plt.xlabel('Precision$\\\\nu$')\n", "plt.ylabel('Number examples$n$')\n", "# Indicate the precision of the original sample\n", "N = len(df['BananaWeight'])\n", "var_n = V/N\n", "nu_orig = np.log10(mu/np.sqrt(var_n))\n", "plt.plot(nu_orig, N, 'bo')" ] }, { "cell_type": "markdown", "id": "burning-airplane", "metadata": {}, "source": [ "Bottomline: the number of examples increases exponetially with the precision." ] }, { "cell_type": "markdown", "id": "static-psychology", "metadata": {}, "source": [ "### Scaling the number of examples with the confidence\n", "As indicated before, the size confidence interval depends on the degree of confidence we want to put into it. According to [Wikipedia](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule):\n", "\"In statistics, the **68–95–99.7 rule**, also known as the empirical rule, is a shorthand used to remember the percentage of values that lie within a band around the mean in a **normal distribution** with a width of two, four and six standard deviations, respectively; more precisely, 68.27%, 95.45% and 99.73% of the values lie within **one, two and three standard deviations of the mean**, respectively.\" In mathematical form, this means:\n", "\n", "Proba$( |X-\\mu| \\leq \\sigma ) \\simeq 0.6827\n", "\n", "Proba$( |X-\\mu| \\leq 2 \\sigma ) \\simeq 0.9545$\n", "\n", "Proba$( |X-\\mu| \\leq 3 \\sigma ) \\simeq 0.9973$\n", "\n", "Even though not all distributions of interest are \"normal\", many are nearly normal and it is usual, instead of reporting confidence intervals as:\n", "\n", "Proba$( |X-\\mu| \\leq \\epsilon ) \\simeq (1 - \\delta)$\n", "\n", "where $\\delta$ is a small value, one only considers for $\\epsilon$ multiple values of $\\sigma$: $1~\\sigma$, $2~\\sigma$, $3~\\sigma$, $\\dots k~\\sigma$, and one does not even bother reporting the exact corresponding value of $\\delta$.\n", "\n", "One question of interest though is how does the number of examples needed to ensure a given confidence interval grow with $\\delta$ (or more simply with the number $k$ in $\\epsilon=k~\\sigma$.\n", "\n", "This is easily settled. Earlier on, we defined the **signal to noise ratio** as snr$= \\mu / \\sigma_{\\mu}$. We indicated that this implicitly corresponds to a $1~\\sigma$ error bar: Measurement = $\\hat{\\mu} \\pm \\sigma_{\\mu}$. We now consider:\n", "\n", "Measurement = $\\hat{\\mu} \\pm \\epsilon_{\\mu} = \\hat{\\mu} \\pm k~\\sigma_{\\mu}$.\n", "\n", "We re-define (more properly) the signal to noise ratio as:\n", "\n", "snr$= \\mu / \\epsilon_{\\mu}$\n", "\n", "where $\\epsilon_{\\mu} = k~\\sigma_{\\mu}$.\n", "\n", "Thus $\\sigma_{\\mu}$ can be expressed as:\n", "\n", "$\\sigma_{\\mu} = \\mu / (k$snr).\n", "\n", "Applying the same defintion as before of numerical precision $\\nu = \\log_{10}$ snr, we also get:\n", "\n", "$\\sigma_{\\mu} = \\mu / (k 10^\\nu)$.\n", "\n", "And now, we can again apply the formula $\\sigma_{\\mu} = \\sigma / \\sqrt{n}$ and solve for $n$ to get:\n", "\n", "
\n", " \n", "SAMPLE SIZE TO GET A GIVEN PRECISION $\\nu$ and CONFIDENCE $k$: $~~~n = \\frac{\\sigma^2}{\\mu^2} .k^2. 10^{2\\nu}$ .\n", " \n", "
\n", "\n", "Bottomline: **$n$ grows only with the square of $k$, but it grows exponentially with $\\nu$**." ] }, { "cell_type": "markdown", "id": "collaborative-spelling", "metadata": {}, "source": [ "Let us pose:\n", "\n", "$n_{11}=\\sigma^2 / \\mu^2$\n", "\n", "We have the following table of growth of $n$ with the number of significant digits $\\nu=$ and the number of sigmas in the error bar $k$:\n", "\n", "| | $\\nu=0$ |$\\nu=1~digit$ | $\\nu=2~digits$ | $\\nu=3~digits$ |\n", "| :--- | :----: | :----: | :----: | ---: |\n", "| $\\epsilon= 1~\\sigma_{\\mu}$ | $~~~~~~n_{11}$ | $100~n_{11}$ | $10000~n_{11}$ | $10^6~n_{11}$\n", "| $\\epsilon= 2~\\sigma_{\\mu}$ | $~~~~4~n_{11}$| $400~n_{11}$ | $40000~n_{11}$ | $4~10^6~n_{11}$ |\n", "| $\\epsilon= 3~\\sigma_{\\mu}$ | $~~~~9~n_{11}$ | $900~n_{11}$ | $90000~n_{11}$ | $9~10^6~n_{11}$ \n", "\n", "We typically recommend to use:\n", "\n", "$n=100~\\sigma^2 / \\mu^2$\n", "\n", "which corresponds to only $1~\\sigma_{\\mu}$ confidence interval and $\\nu=1$ significant digit, which is the least we can do to get descent error bars. In the particular case of the Bernouilli distribution with parameter $p$, we have $\\mu = p$ and $\\sigma^2 = p(1-p)$. Thus if $p \\ll 1$, we find again:\n", "\n", "$n \\simeq 100 /p$" ] }, { "cell_type": "markdown", "id": "smooth-playing", "metadata": {}, "source": [ "### Bootstrap resampling\n", "In the previous we found a simple scaling law for the number of examples needed to get a good precision of the mean. However, we made the assumption that the sample mean was (nearly) distributed with the Normal law. This is actually not so bad if the sample are drawn from the Bernoulli distribution, because the sum of Bernouilli random variables is distributed according to the Binomial law, which can be approximated with the Normal law for (relatively) large values of $n$. But consider again the distribution of final exam grades:" ] }, { "cell_type": "code", "execution_count": null, "id": "motivated-capital", "metadata": {}, "outputs": [], "source": [ "gdf[col].hist(density=True)\n", "plt.xlabel(col)\n", "plt.ylabel('Density')" ] }, { "cell_type": "markdown", "id": "taken-timer", "metadata": {}, "source": [ "It has a \"wierd\" distribution. How could we know whether the distribution of its mean values (for different students, but similarly distributed) would be \"Normal\"? it is probably not normal. The idea othe the bootstrap is to sample $n$ times (with replacement) examples from the $n$ empirical values we have to obtain a \"new\" bootstrap sample, whose distribution is similar to that of our original sample. We do that $b$ to get $b$ bootstrap samples. Each of them has $n$ examples (with some repetitions). For each sample of size $n$, we can calculate its mean and see how it is distributed. Let's give it a try!" ] }, { "cell_type": "code", "execution_count": null, "id": "suitable-labor", "metadata": {}, "outputs": [], "source": [ "def mean_bootstrap_samples(df, b=100000):\n", " # take a df series and sample n values with replacement b times\n", " # return a vector of mean values over the b bootstraps\n", " n = len(df)\n", " DF = np.random.choice(df, size =(n, b))\n", " MU = np.mean(DF, axis=0)\n", " return MU" ] }, { "cell_type": "code", "execution_count": null, "id": "protective-shuttle", "metadata": {}, "outputs": [], "source": [ "bs = mean_bootstrap_samples(gdf[col])" ] }, { "cell_type": "code", "execution_count": null, "id": "sexual-canada", "metadata": {}, "outputs": [], "source": [ "mu, sigma = bs.mean(), bs.std() # mean and standard deviation\n", "count, bins, ignored = plt.hist(bs, 30, density=True) # make the histogram\n", "plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * # plt the normal density\n", " np.exp( - (bins - mu)**2 / (2 * sigma**2) ),\n", " linewidth=2, color='r')\n", "plt.xlabel(\"Mean \" + col)\n", "plt.ylabel('Density')" ] }, { "cell_type": "markdown", "id": "accepted-irrigation", "metadata": {}, "source": [ "Morality: the density of the mean really looks like the Normal density." ] }, { "cell_type": "code", "execution_count": null, "id": "significant-shareware", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "jupytext": { "text_representation": { "extension": ".md", "format_name": "myst", "format_version": 0.13, "jupytext_version": "1.10.3" } }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "source_map": [ 12, 25, 45, 52, 54, 58, 60, 64, 66, 73, 84, 109, 113, 117, 121, 128, 131, 139, 142, 184, 189, 194, 196, 200, 215, 247, 249, 256, 275, 279, 283, 288, 293, 297, 301, 325, 327, 357, 359, 365, 367, 371, 373, 389, 394, 399, 403, 406, 410, 414, 418, 425, 429, 434, 436, 441, 443, 447, 449, 457, 473, 479, 484, 488, 496, 534, 538, 542, 546, 552, 556, 628, 651, 669, 683, 774, 806, 846, 857, 882, 886, 934, 956, 961, 965, 969, 979, 983, 991, 995 ] }, "nbformat": 4, "nbformat_minor": 5 }