Homework 1: Your first job as a Data Scientist

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import seaborn as sns; sns.set()
%load_ext autoreload
%autoreload 2
from platform import python_version
print(python_version()) # Check which version of Python you are running

You are hired by Picoprix to evaluate fruit recognizers. See the full homework description in Chapter 1. First we generate here fake data to simulate the data that you collected for b=30 days, corresponding to the classification of n=100 images each day (50 apples and 50 bananas). We simulate the errors made by two recognizers: “Pomd’Or” and “Excel Fruit”. Each line in both data tables represents a day, each column represents an image. An entry 1 means an error, and entry 0 means a correct classification (we do not record whether those were images of apples or bananas, we just record whether the classifications were correct or not; presumably the samples were picked at random, in no particular order).

from scipy.stats import bernoulli # Errors made are like coin flips, the follow a Bernoulli distribution
b = 30  # Number of batches (days)
n = 100 # Number of samples per day
p = 0.045 # Error rate of Pomd'Or and Excel Fruit, hé hé, the same
pom_dor = bernoulli.rvs(p, size=(b*n)).reshape((b, n))
excel_fruit = bernoulli.rvs(p, size=(b*n)).reshape((b, n))

Compute the statistics

As requested generate Table 1.1: Error rate average, 1-sigma error bar, Maximum error rate any day

np_stat = np.zeros((2,3))
np_mu = np.zeros((2,b)) # We will be saving the daily means for future usage
for data in (pom_dor, excel_fruit):
    mu = np.mean(data, axis=1)
    np_stat[i,0] = np.mean(mu)
    np_stat[i,1] = np.std(mu) # Caheck whether by default this will divide by d or d-1
    np_stat[i,2] = np.max(mu)
    np_mu[i] = mu
    i = i+1
# Comvert the numpy array np_stat to a data frame (for convenience)
df_stat = pd.DataFrame(np_stat,  
                       columns = ['Error rate average','1-sigma error bar,','Maximum error rate any day'],  
                       index = ['Pomd\'Or', 'Excel Fruit'])

Make histograms

First it will be handy to create a data frame with the daily means for each recognizer

df_mu = pd.DataFrame(np_mu.transpose(),  
                       columns = ['Pomd\'Or', 'Excel Fruit'])
plt.xlabel('Error rate')
df_mu.plot.scatter(x='Pomd\'Or', y='Excel Fruit', c='DarkBlue')
epsi = 0.01
maxi = np.max(np_mu.flatten())
plt.xlim(-epsi, maxi+epsi)
plt.ylim(-epsi, maxi+epsi)
plt.gca().set_aspect('equal', adjustable='box')
M = np.sort(np_mu[0,:])
B = np.array(range(b))
plt.plot(M, B)
plt.title('Cumulative distribution of Pomd\'Or')
plt.xlabel('Error rates')
plt.ylabel('Cumulated number of occurences')
def cumulative_distribution(vect, title=''):
    M = np.sort(vect)
    B = np.array(range(len(vect)))
    plt.plot(M, B)
    plt.title('Cumulative distribution of ' + title)
    plt.xlabel('Error rates')
    plt.ylabel('Cumulated number of occurences')
cumulative_distribution(np_mu[0,:], 'Pomd\'Or')
cumulative_distribution(np_mu[1,:], 'Excel Fruit')