When we write | We mean |
---|---|

Measurement = $\mu \pm \sigma$ | A 68% CI, for data Normally distributed. Proba$(\mu-\sigma |

Measurement = $\mu \pm 2 \sigma$ | A 95% CI, for data Normally distributed. Proba$(\mu-2\sigma |

Measurement = $[L, U]$, 95% CI | $L$ is the lower bound and $U$ the upper bound of the 95% CI |

**WARNING:** We do not always make a distinction between:
* the "true" mean of a random variable X $\mu$, also called mathematical expectation, (which is obtained by averaging an infinite number of examples, drawn according to the distribution of X), and
* the empirical mean $\hat{\mu}$ (which is obtained by averaging an FINITE number of examples).
unless it is necessary because we manipulate both at the same time.
Note that $\hat{\mu}$ is a **random variable** and as such it has a variance. We obtain different values of $\hat{\mu}$ every time we draw randomly (and independently) n values of X over which we average.

**PROOF:** You can skips this proof, but if you are interested, read on, it is not that hard.
Consider n random variables (RV) distributed identically, but independent of one another: $X_1, X_2 \cdots X_n$, of mean $\mu$ and of variance $\sigma$. The variance of the sum will be the sum of the variances (search for "Sum of uncorrelated variables" in the [Wikipedia page on variance](https://en.wikipedia.org/wiki/Variance)):
$var(\sum_{i=1}^n X_i) = \sum_{i=1}^n var(X_i) = n \sigma^2$.
The empirical mean is
$\hat{\mu} = (1/n) (\sum_{i=1}^n X_i)$.
We put a hat on top of the empirical mean to distiguish it from the true mean $\mu$. We also know that when a RV is scaled by a constant $a$, its variance is multiplied by $a^2$ (search for "Basic properties" in the [Wikipedia page on variance](https://en.wikipedia.org/wiki/Variance)). So:
var$(\hat{\mu})=(1/n)^2 var(\sum_{i=1}^n X_i) = (1/n)^2 n \sigma^2 = \sigma^2/n. $
Therefore the **variance of the mean** is: var$(\hat{\mu}) = \sigma_{\mu}^2 = \sigma^2/n$

The standard deviation of the mean (also known as **standard error**) is obtained by taking the square root of the variance of the mean:
**VERIFICATION:**
We now proceed to a simple empirical verification. We draw $b$ samples of $n$ values according to the Normal distribution (each one is called a "batch" or "block"). We compute the means $\mu_i$ of each of these $b$ batches, $b = 1 \cdots n$. Then we compute the empirical variance $\sigma_{\mu}$ of the means $\mu_i$, over the $b$ batches, that is $\hat{\sigma}_{\mu}^2 = (1/b)\sum_{i=1}^b (\mu_i - \hat{\mu})^2$, where $\hat{\mu} = (1/b)\sum_{i=1}^b \mu_i$. We added a hat on $\sigma_{\mu}^2$ and $\mu$ to remind ourselves that this is only an approximation evaluated on $b$ batches, but if we make $b$ large it should be good. If the theory is correct, then $\hat{\sigma}_{\mu}^2$ should be inversely proportional to $n$. This is easy to visualize on a log-log plot since $\log \sigma_{\mu} = \log \sigma - \log n$. Hence, the points should be aligned and the line should have a slope -1 in log-log coordinates.

```{code-cell} ipython3
# For concreteness, we choose as mean and variance that of our empirical sample of banana weights
mu, sigma = df['BananaWeight'].mean(), df['BananaWeight'].std() # mean and standard deviation
# We make a note of the actual sample size
N = len(df['BananaWeight'])
# We make a note of the variance
V = sigma**2
# We select some sample sizes, growing exponentially, to verify the law of large numbers
n = [1, 10,100,1000,10000]
# We select a large batch size
b = 10000
v = []
for num in n:
mu_vec = np.zeros((b,1))
for i in range(b):
# We draw a vector of size num containing fake banana weights
x = np.random.normal(mu, sigma, num)
# We evaluate the mean for this number of bananas (num)
mu_vec[i,0] = x.mean() # We repeat this b times and fill a vector of b mu values
# We compute for each num value the variance of the mu, over the b batches
v = np.append(v, mu_vec.var())
```
```{code-cell} ipython3
def log_log_plt(n, v, N, V):
# Plot a list of variance values as a function of a list of n values
# Indicate in blue the actual value of N and the variance of mean predicted by the theory
# Assume n[0]=1
plt.plot(n, v)
plt.plot(n, v, 'ro')
plt.plot(N, V/N, 'bo')
ax = plt.gca()
ax.set_xscale('log')
ax.set_yscale('log')
plt.xlabel('Number examples $n$')
plt.ylabel('Variance of mean $\sigma_{\mu}^2$')
plt.text(n[0], v[0], '$\sigma^2_{\mu}=\sigma^2$', fontsize=18, color='r') # coordinate at origin
plt.text(N, V/N, '$\sigma^2_{\mu}=\sigma^2/N$', fontsize=18, color='b') # coordinate of original N value
plt.title('Variation of $\sigma^2_{\mu}$ as a function of $n$, for $\sigma^2=$%.0f and N=%d' % (V, N))
log_log_plt(n, v, N, V)
```
We verify that the points are well aligned, with slope (-1) in the log-log scale, in accordance with:
**WARNING:** By doing that, we consider only the **statistical error** also called **random error** and ignore possible **systematic errors** (we will see that in the chapter on bias) and we do not use the [standard comupter science definition of numerical precision](https://en.wikipedia.org/wiki/Precision_(computer_science)). Our notion of precision is that used in **[signal processing](https://en.wikipedia.org/wiki/Signal-to-noise_ratio)** and **[statistics](http://www.dspguide.com/ch2/7.htm)**. It is related to the standard deviation $\sigma$, the **signal to noise ratio** $snr= \mu / \sigma$ or its inverse the **coefficient of variation** $cv = \sigma / \mu$ (or its square, the **efficiency** $eff = (\sigma / \mu)^2$). Hence: $\nu = \log_{10} snr = - \log_{10} cv = - (1/2) \log_{10} eff$.

Now, if we use ONE-sigma bars (68\% condidence intervals for the Normal distribution) we will report results as:
Measurement = $\hat{\mu} \pm \sigma_{\mu}$
For example:
Measurement = $1.00 \pm 0.01$
In this case we will have $\hat{\mu}/\sigma_{\mu}=100$, thus $\nu=2$ and we see that $\nu=2$ corresponds to the number of decimal places. If we have:
Measurement = $123.000 \pm 0.123$
then, $\hat{\mu}/\sigma_{\mu}=1000$ , thus $\nu=3$, and we will call that our numerical precision. We will keep only ONE trailing non zero number in the error bar and write the result as:
Measurement = $123.0 \pm 0.1$
**BERNOULLI TRIALS:** This formula looks daunting, but we can simplifying in one very useful particular case: If the quantity of interest $X$ is just coin flips (e.g. 1 for an error and 0 for a correct answer), with a probability $p$ of heads and $(1-p)$ of tail. In that case, $X$ is known to be distributed according to a **[Bernoulli distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution)** for which $\mu=p$ and $\sigma=p(1-p)$. Using $\nu=1$ (one significant digit) and $\hat{\mu}$ as our estimator of $p$, if $p \ll 1$ we get the simplified formula:
$n p \simeq 100$
which is easy to remember. It means that to get "good" error bars, we need to see at least 100 errors.

**GRAPHICAL REPRESENTATION:**
Let is go back to our graph in log-log coordinates and plot both:
$\sigma^2_{\mu} = \sigma^2 / n$
$\sigma^2_{\mu} = \hat{\mu}^2/10^{2\nu}$.
The intersection of the two will give us the desired numerical precision. In log coordinates, we just need to find the intersection of two lines:
$\log_{10} \sigma_{\mu}^2= \log_{10} \sigma^2 - \log_{10} n$
$\log_{10} \sigma_{\mu}^2= \log_{10} \hat{\mu}^2 - 2 \nu $.
As a numerical application, let us use our original BananaWeight example.

```{code-cell} ipython3
# (1) This part is similar as before, except that we only give the 1st and the last point of the line
# the first one corrsponding to (n=1, sigma^2_mu=sigma^2) and the last (n=n_max, sigma^2_mu=sigma^2/n_max)
# We do not compute sigma^2_mu empirically, we just apply the formula sigma^2/n
# Compute the empirical mean
mu = df['BananaWeight'].mean()
# Compute the empirical variance
V = df['BananaWeight'].var()
# Find original value of n
N = len(df['BananaWeight'])
# Compute variance of the mean for the original sample number N
var_n = V/N
# Compute the original precision
nu_orig = np.round(np.log10(mu/np.sqrt(var_n)))
# Provide the original precision
print('Using N = %d examples, mean(BananaWeight) = %d +- %d (nu = %d)' % (N, np.round(mu), np.round(np.sqrt(var_n)), nu_orig))
# Compute variance of the mean for a vary large value of n (n_max)
n_max = 10000
var_n_max = V/n_max
# Make the log-log plot
log_log_plt([1, n_max], [V, var_n_max], N, V)
# (2) This part is new. We want to impose a larger number of significant digits.
# We choose a precision nu and compute the number of examples n_nu necessary to get that precision
# Choose precision
nu = 3 # Three significant digits
# Get var_n_nu = mu^2 / 10^2nu
var_new_n = (mu/10**nu)**2
# Get new n value at intersection
new_n = np.round((var/var_new_n))
# Plot horizontal line at var_n_new
plt.plot([1, new_n], [var_new_n, var_new_n])
print(var_new_n, V/new_n, np.sqrt(var_new_n))
print('Using N*= %d examples, mean(BananaWeight) = %5.1f +- %.1f (nu = %d)' % (new_n, mu, np.sqrt(var_new_n), nu))
# Plot vertical bar and show tex for new N value N*
plt.plot([new_n, new_n], [var_new_n, var_n_max], "--", color='black')
plt.plot(new_n, var_new_n, 'ko')
plt.text(new_n, var_new_n, '$\sigma^2_{\mu}=\sigma^2/N*$', fontsize=18, color='k') # coordinate of new N value
plt.text(N, var_new_n, '$\sigma^2_{\mu}=\hat{\mu}^2/10^{2\\nu}$', fontsize=18, color='k') # coordinate of new N value
plt.text(new_n/1.5, var_n_max/4, '$N*$', fontsize=18, color='k') # Desired value of N*
```
**NUMBER OF EXAMPLES AS A FUNCTION OF $\nu$:** Taking to log$_{10}$ of $~~n = \frac{\sigma^2}{\hat{\mu}^2} 10^{2\nu}$ we get:
$\log_{10}n = 2\nu + b $ ,
with $b = \log_{10} \sigma^2 - 2 \log_{10} \hat{\mu}$. So, given empirical estimates of the mean and the variance of the quantity of interest $\sigma$ and $\hat{\mu}$ (for a given number of examples $N$), we can easily make an abacus to determine the number of examples that would be needed to obtain any desired precision $\nu$ on the estimate of the mean.

```{code-cell} ipython3
# Compute the empirical mean
mu = df['BananaWeight'].mean()
# Compute the empirical variance
V = df['BananaWeight'].var()
# Compute b
b = np.log10(V) - 2* np.log10(mu)
# Choose first and last point
min_nu = 0
max_nu = 3
min_logn = 2*min_nu +b
max_logn = 2*max_nu +b
min_n = 10**min_logn
max_n = 10**max_logn
# Plot n on a log scale
plt.plot([min_nu, max_nu], [min_n, max_n], "-", color='black')
ax = plt.gca()
ax.set_yscale('log')
plt.xlabel('Precision $\\nu$')
plt.ylabel('Number examples $n$')
# Indicate the precision of the original sample
N = len(df['BananaWeight'])
var_n = V/N
nu_orig = np.log10(mu/np.sqrt(var_n))
plt.plot(nu_orig, N, 'bo')
```
Bottomline: the number of examples increases exponetially with the precision.
+++
### Scaling the number of examples with the confidence
As indicated before, the size confidence interval depends on the degree of confidence we want to put into it. According to [Wikipedia](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule):
"In statistics, the **68–95–99.7 rule**, also known as the empirical rule, is a shorthand used to remember the percentage of values that lie within a band around the mean in a **normal distribution** with a width of two, four and six standard deviations, respectively; more precisely, 68.27%, 95.45% and 99.73% of the values lie within **one, two and three standard deviations of the mean**, respectively." In mathematical form, this means:
Proba$( |X-\mu| \leq \sigma ) \simeq 0.6827$$
Proba$( |X-\mu| \leq 2 \sigma ) \simeq 0.9545$
Proba$( |X-\mu| \leq 3 \sigma ) \simeq 0.9973$
Even though not all distributions of interest are "normal", many are nearly normal and it is usual, instead of reporting confidence intervals as:
Proba$( |X-\mu| \leq \epsilon ) \simeq (1 - \delta)$
where $\delta$ is a small value, one only considers for $\epsilon$ multiple values of $\sigma$: $1~\sigma$, $2~\sigma$, $3~\sigma$, $\dots k~\sigma$, and one does not even bother reporting the exact corresponding value of $\delta$.
One question of interest though is how does the number of examples needed to ensure a given confidence interval grow with $\delta$ (or more simply with the number $k$ in $\epsilon=k~\sigma$.
This is easily settled. Earlier on, we defined the **signal to noise ratio** as snr$ = \mu / \sigma_{\mu}$. We indicated that this implicitly corresponds to a $1~\sigma$ error bar: Measurement = $\hat{\mu} \pm \sigma_{\mu}$. We now consider:
Measurement = $\hat{\mu} \pm \epsilon_{\mu} = \hat{\mu} \pm k~\sigma_{\mu}$.
We re-define (more properly) the signal to noise ratio as:
snr$ = \mu / \epsilon_{\mu}$
where $\epsilon_{\mu} = k~\sigma_{\mu}$.
Thus $\sigma_{\mu}$ can be expressed as:
$\sigma_{\mu} = \mu / (k $snr).
Applying the same defintion as before of numerical precision $\nu = \log_{10}$ snr, we also get:
$\sigma_{\mu} = \mu / (k 10^\nu)$.
And now, we can again apply the formula $\sigma_{\mu} = \sigma / \sqrt{n}$ and solve for $n$ to get: