---
jupytext:
text_representation:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.10.3
kernelspec:
display_name: Python 3
language: python
name: python3
---
# Homework 1: Your first job as a Data Scientist
```{code-cell} ipython3
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import seaborn as sns; sns.set()
%load_ext autoreload
%autoreload 2
from platform import python_version
print(python_version()) # Check which version of Python you are running
```
You are hired by Picoprix to evaluate fruit recognizers. See the full homework description in Chapter 1. First we generate here fake data to simulate the data that you collected for b=30 days, corresponding to the classification of n=100 images each day (50 apples and 50 bananas). We simulate the errors made by two recognizers: "Pomd'Or" and "Excel Fruit". Each line in both data tables represents a day, each column represents an image. An entry 1 means an error, and entry 0 means a correct classification (we do not record whether those were images of apples or bananas, we just record whether the classifications were correct or not; presumably the samples were picked at random, in no particular order).
```{code-cell} ipython3
from scipy.stats import bernoulli # Errors made are like coin flips, the follow a Bernoulli distribution
b = 30 # Number of batches (days)
n = 100 # Number of samples per day
p = 0.045 # Error rate of Pomd'Or and Excel Fruit, hé hé, the same
pom_dor = bernoulli.rvs(p, size=(b*n)).reshape((b, n))
excel_fruit = bernoulli.rvs(p, size=(b*n)).reshape((b, n))
```
## Compute the statistics
As requested generate Table 1.1: Error rate average, 1-sigma error bar, Maximum error rate any day
```{code-cell} ipython3
np_stat = np.zeros((2,3))
np_mu = np.zeros((2,b)) # We will be saving the daily means for future usage
i=0
for data in (pom_dor, excel_fruit):
mu = np.mean(data, axis=1)
np_stat[i,0] = np.mean(mu)
np_stat[i,1] = np.std(mu) # Caheck whether by default this will divide by d or d-1
np_stat[i,2] = np.max(mu)
np_mu[i] = mu
i = i+1
# Comvert the numpy array np_stat to a data frame (for convenience)
df_stat = pd.DataFrame(np_stat,
columns = ['Error rate average','1-sigma error bar,','Maximum error rate any day'],
index = ['Pomd\'Or', 'Excel Fruit'])
```
```{code-cell} ipython3
df_stat
```
```{code-cell} ipython3
np_mu.shape
```
## Make histograms
First it will be handy to create a data frame with the daily means for each recognizer
```{code-cell} ipython3
df_mu = pd.DataFrame(np_mu.transpose(),
columns = ['Pomd\'Or', 'Excel Fruit'])
```
```{code-cell} ipython3
df_mu.plot.hist(alpha=0.7)
plt.xlabel('Error rate')
```
```{code-cell} ipython3
df_mu.plot.scatter(x='Pomd\'Or', y='Excel Fruit', c='DarkBlue')
epsi = 0.01
maxi = np.max(np_mu.flatten())
plt.xlim(-epsi, maxi+epsi)
plt.ylim(-epsi, maxi+epsi)
plt.gca().set_aspect('equal', adjustable='box')
```
```{code-cell} ipython3
M = np.sort(np_mu[0,:])
B = np.array(range(b))
plt.plot(M, B)
plt.title('Cumulative distribution of Pomd\'Or')
plt.xlabel('Error rates')
plt.ylabel('Cumulated number of occurences')
```
```{code-cell} ipython3
def cumulative_distribution(vect, title=''):
M = np.sort(vect)
B = np.array(range(len(vect)))
plt.plot(M, B)
plt.title('Cumulative distribution of ' + title)
plt.xlabel('Error rates')
plt.ylabel('Cumulated number of occurences')
```
```{code-cell} ipython3
cumulative_distribution(np_mu[0,:], 'Pomd\'Or')
```
```{code-cell} ipython3
cumulative_distribution(np_mu[1,:], 'Excel Fruit')
```
```{code-cell} ipython3
```