Data tables with Pandas#

Todo

  • Mystification

Data Manipulation#

We will learn to manipulate data, that is, a table with

attributes in columns and instances in rows in Python.

Pandas Library#

Pandas is a Python library that allows you to:

  • manipulate data tables, and

  • perform statistical analyses

as you could with a spreadsheet (Excel for example), but

programmatically, thus allowing automation.

Data tables (type DataFrame) are tables with two (or more)

dimensions, with labels (labels) on the rows and columns. The data is not necessarily

homogeneous: from one column to another, the type of data can

change (strings, floats, integers, etc.); in addition, some data

may be missing.

Series (type Series) are one-dimensional tables (vectors),

typically obtained by extracting a column from a DataFrame.

In this course, by table, we will mean a two-dimensional table of type

DataFrame, while by series we will mean a one-dimensional table of type

Series.

These concepts of DataFrame and Series are found in other libraries or

data analysis systems such as R.

In addition, Pandas allows you to process massive data distributed over very many

computers by relying on parallelism libraries such as

dask

Data Series#

We must begin by importing the pandas library; it is traditional to define a shortcut pd:

import pandas as pd

Let’s construct a series of temperatures:

temperatures = pd.Series([8.3, 10.5, 4.4, 2.9, 5.7, 11.1], name="Température")
temperatures
0     8.3
1    10.5
2     4.4
3     2.9
4     5.7
5    11.1
Name: Température, dtype: float64

Note that the series is considered as a column and the indices of its

rows are displayed. By default, these are the integers \(0,1,\ldots\), but other indices

are possible. As with C++ arrays or Python lists, we use the notation

‘t[i]’ to extract the row with index i:

temperatures[3]
np.float64(2.9)

The size of the series is obtained in the traditional way with Python using the

len function:

len(temperatures)
6

Let’s calculate the average temperature using the method mean:

temperatures.mean()
np.float64(7.1499999999999995)

Calculate the maximum temperature using the max method:

### BEGIN SOLUTION
temperatures.max()
### END SOLUTION
np.float64(11.1)

Calculate the minimum temperature:

### BEGIN SOLUTION
temperatures.min()
### END SOLUTION
np.float64(2.9)

What is the temperature data range?

### BEGIN SOLUTION
temperatures.max() - temperatures.min()
### END SOLUTION
np.float64(8.2)

Data tables (DataFrame)#

We will now construct a table containing the acidity data of water from

several wells. It will have two columns: one for the name of the wells and the other for the

pH value (acidity).

We can now build the table from the list of well names and

the list of well pH values:

df = pd.DataFrame({
    "Noms" : ['P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7', 'P8', 'P9', 'P10'],
    "pH"   : [ 7.0,  6.5,  6.8,  7.1,  8.0,  7.0,  7.1,  6.8,  7.1,  7.1 ]
})
df
Noms pH
0 P1 7.0
1 P2 6.5
2 P3 6.8
3 P4 7.1
4 P5 8.0
5 P6 7.0
6 P7 7.1
7 P8 6.8
8 P9 7.1
9 P10 7.1
You will notice that:
  • It is traditional to name df (for DataFrame) the variable containing the table. But it is desirable to use a better name whenever natural!

  • The first column of the table gives the index of the rows. By default, it is their number, starting at 0, in addition to the “Names” and “pH” columns.

This two-dimensional table is seen as a collection of columns. As a result,

df[label] extracts the column with the label label, in the form of a series:

df['Noms']
0     P1
1     P2
2     P3
3     P4
4     P5
5     P6
6     P7
7     P8
8     P9
9    P10
Name: Noms, dtype: object

You can then access each of the table’s values by specifying the label of

its column and then the index of its row. Here is the name in the second row (index 1):

df['Noms'][1]
'P2'

And the pH value in the fourth line (index 3):

df['pH'][3]
np.float64(7.1)
Again, you will notice that the access is in the form `df[colonne][ligne]` whereas

with a C++ array the access would be in the form t[ligne][colonne].

Metadata and Statistics#

We now use Pandas to extract some metadata and statistics from

our data.

First, the size of the table:

df.shape
(10, 2)

Column titles:

df.columns
Index(['Noms', 'pH'], dtype='object')

Number of lines:

len(df)
10

General information:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Noms    10 non-null     object 
 1   pH      10 non-null     float64
dtypes: float64(1), object(1)
memory usage: 292.0+ bytes

The average of each column for which it makes sense, that is to say only the

columns containing numerical values (here only the pH):

df.mean(numeric_only = True)
pH    7.05
dtype: float64

Standard deviations:

df.std(numeric_only = True)
pH    0.38658
dtype: float64

The median:

df.median(numeric_only = True)
pH    7.05
dtype: float64

The 25% quantile:

df.quantile(.25, numeric_only=True)
pH    6.85
Name: 0.25, dtype: float64

The minimum and maximum values:

df.min()
Noms     P1
pH      6.5
dtype: object
df.max()
Noms     P9
pH      8.0
dtype: object

A summary of the key statistics:

df.describe()
pH
count 10.00000
mean 7.05000
std 0.38658
min 6.50000
25% 6.85000
50% 7.05000
75% 7.10000
max 8.00000

The maximum pH index:

df['pH'].idxmax()
4

A histogram of the pH. Note: Here, we have the option bins = 20, which means that we

divide the range of pH (the x*-axis*) into 20 groups of the same size. Change the value of

bins to 2 and see the difference:

df['pH'].hist(bins=20);
../_images/a4cb8e42b055ebb09629f8547deb413115dba50d48f68b9c7a698b655415c4bd.png

With pyplot, it is possible to refine the result by adding labels to the axes,

etc:

import matplotlib.pyplot as plt

df['pH'].plot(kind='hist')
plt.grid()
plt.ylabel('Counts')
plt.xlabel('pH');
../_images/daee8ab744cb2d0361f5cad9338fbbddeb1612374b384af22e8c35b356d914c4.png

Database operations#

Pandas allows you to perform database operations (for those who did the

Free Data project, remember “select”, “group by”, “join”, etc.). We

only show here the selection of rows; you will play with “group by” later.

Let’s transform the table so that one of the columns serves as indices; here we will

use the names:

df1 = df.set_index('Noms')
df1
pH
Noms
P1 7.0
P2 6.5
P3 6.8
P4 7.1
P5 8.0
P6 7.0
P7 7.1
P8 6.8
P9 7.1
P10 7.1

It is now possible to access a pH value by directly using the name

as a row index:

df1['pH']['P1']
np.float64(7.0)

Let’s now select all the pH \(7.1\) lines:

df1[df1['pH'] == 7.1]
pH
Noms
P4 7.1
P7 7.1
P9 7.1
P10 7.1

How does it work?

Let us denote pH the column with the same label:

pH = df1['pH']
pH
Noms
P1     7.0
P2     6.5
P3     6.8
P4     7.1
P5     8.0
P6     7.0
P7     7.1
P8     6.8
P9     7.1
P10    7.1
Name: pH, dtype: float64

As with NumPy, all operations are vectorized on all elements of the

array or series. Thus, if we write pH + 1 (which makes no

mathematical sense, the objects being of very different types: pH is a series while 1

is a number), this adds 1 to all the values of the series:

pH + 1
Noms
P1     8.0
P2     7.5
P3     7.8
P4     8.1
P5     9.0
P6     8.0
P7     8.1
P8     7.8
P9     8.1
P10    8.1
Name: pH, dtype: float64

Similarly, if we write pH == 7.1, this returns a series of booleans,

each indicating whether the corresponding value is equal to \(7.1\) or not:

pH == 7.1
Noms
P1     False
P2     False
P3     False
P4      True
P5     False
P6     False
P7      True
P8     False
P9      True
P10     True
Name: pH, dtype: bool

Finally, if you index an array by a series of booleans, it extracts the rows for which the series contains True:

df1[pH == 7.1]
pH
Noms
P4 7.1
P7 7.1
P9 7.1
P10 7.1

Exercise: grades#

The Info 111 grades (anonymous!) from last year are in the CSV file notes_info_111.csv. Consult the content of this file:

You will notice that the values are separated by commas ‘,’ (CSV: Comma Separated

Value).

Here’s how to load this file as a Pandas table:

df = pd.read_csv("data/notes_info_111.csv", sep=",")

Display simple statistics in the:

### BEGIN SOLUTION
df.describe()
### END SOLUTION
Total de Examen mi-semestre (EE) (Brut) projet (CCTP) (Brut) Note enseignant (Brut) Outil externe Contrôle 1 PLaTon, jeudi 21 et vendredi 22 octobre (Brut) Outil externe Contrôle 2 PLaTon, jeudi 9 - vendredi 10 décembre (Brut) Total de Contrôles PLaTon (Brut) Total de Note de TD (CCE) (Brut) Total de Examen final (EEF) (Brut) Total du cours (Brut)
count 276.000000 266.000000 272.000000 267.000000 267.000000 285.000000 286.000000 273.000000 286.000000
mean 13.715942 15.308271 15.978309 16.972846 17.319663 16.088035 15.631049 12.463004 13.420420
std 4.448350 4.625492 3.465690 4.306431 3.996111 5.528698 4.366991 5.763870 4.828153
min 0.600000 0.000000 5.000000 0.000000 0.000000 0.000000 0.000000 0.200000 0.000000
25% 10.400000 13.000000 14.000000 16.150000 16.560000 15.260000 14.085000 7.800000 9.997500
50% 14.000000 17.000000 16.550000 18.850000 19.060000 18.490000 16.915000 13.200000 14.395000
75% 17.800000 19.000000 18.000000 19.620000 19.690000 19.530000 18.497500 18.000000 17.537500
max 20.000000 20.000000 20.000000 20.000000 20.000000 20.000000 20.000000 20.000000 20.000000

With DataGrid, you can interactively explore the table; try some of the following:

  • filtering ( icon);

  • sorting (click on the column header to sort by).

todo

verify datagrid for next year

from ipydatagrid import DataGrid
DataGrid(df)

Exercise: First names#

This exercise involves analyzing a database concerning first names

given in Paris between 2004 and 2018 (memories of S1?). This data is freely

accessible also on the

opendata site of the city of

Paris. The following shell command will download them into the

liste_des_prenoms.csv file if it is not already present.

if [ ! -f liste_des_prenoms.csv ]; then
    curl --http1.1 "https://opendata.paris.fr/explore/dataset/liste_des_prenoms/download/?format=csv&timezone=Europe/Berlin&lang=fr&use_labels_for_header=true&csv_separator=%3B" -o data/liste_des_prenoms.csv
fi
  1. Open the file to view its contents.

  2. Using the example above as inspiration, import the file data/liste_des_prenoms.csv into a table prenoms. Note: the file uses semicolons ; as separators and not commas ,.

### BEGIN SOLUTION
prenoms = pd.read_csv('data/liste_des_prenoms.csv', sep=';')
### END SOLUTION

If the test below fails, please verify that you have correctly named your array

prenoms (without accent, with an s):

assert isinstance(prenoms, pd.DataFrame)
assert list(prenoms.columns) == ['Nombre prénoms déclarés', 'Sexe', 'Annee', 'Prenoms','Nombre total cumule par annee']
  1. Display the first ten lines of the file. Hint: consult the documentation of the head method using prenoms.head?

### BEGIN SOLUTION
prenoms.head(10)
### END SOLUTION
Nombre prénoms déclarés Sexe Annee Prenoms Nombre total cumule par annee
0 179 M 2024 Isaac 179
1 125 M 2024 Paul 125
2 83 F 2024 Lina 83
3 78 F 2024 Victoria 78
4 77 M 2024 Aylan 77
5 68 M 2024 Zayn 68
6 61 F 2024 Inaya 61
7 60 F 2024 Mariam 60
8 55 F 2024 Inès 55
9 48 M 2024 Orso 48
  1. Display the lines corresponding to your first name (or another first name such as Mohammed, Maxime or even Brune if your first name is not in the list):

### BEGIN SOLUTION
prenoms[prenoms.Prenoms == "Brune"]
### END SOLUTION
Nombre prénoms déclarés Sexe Annee Prenoms Nombre total cumule par annee
655 7 F 2006 Brune 7
712 11 F 2007 Brune 11
1037 20 F 2010 Brune 20
2785 25 F 2015 Brune 25
3601 53 F 2018 Brune 53
3771 26 F 2016 Brune 26
3869 50 F 2019 Brune 50
7527 42 F 2021 Brune 42
8048 44 F 2020 Brune 44
14110 27 F 2012 Brune 27
14246 23 F 2014 Brune 23
14918 20 F 2011 Brune 20
15955 6 F 2004 Brune 6
17181 26 F 2017 Brune 26
20948 18 F 2013 Brune 18
21774 16 F 2008 Brune 16
22038 10 F 2005 Brune 10
22830 18 F 2009 Brune 18
23317 55 F 2022 Brune 55
25488 40 F 2023 Brune 40
26008 42 F 2024 Brune 42
  1. Do the same interactively with DataGrid:

DataGrid(prenoms)
  1. Extract the female first names (with repetitions) into a variable prenoms_femmes, in the form of a series. Hint: see in the examples above how to select rows and how to extract a column.

### BEGIN SOLUTION
prenoms_femmes = prenoms['Prenoms'][prenoms['Sexe'] == 'F']
prenoms_femmes
### END SOLUTION
2            Lina
3        Victoria
6           Inaya
7          Mariam
8            Inès
           ...   
26669      Saskia
26672       Edmée
26673       Enola
26674        Gala
26675       Perla
Name: Prenoms, Length: 13583, dtype: object

How many are there?

### BEGIN SOLUTION
len(prenoms_femmes)
### END SOLUTION
13583
# Vérifie que prenoms_femmes est bien une série
assert isinstance(prenoms_femmes, pd.Series)
# Vérifie le premier prénom alphabetiquement
assert prenoms_femmes.min() == 'Aaliyah'
# Vérifie le nombre de prénoms après suppression des répétitions
assert len(set(prenoms_femmes)) == 1491
  1. Do the same for men’s first names:

### BEGIN SOLUTION
prenoms_hommes = prenoms['Prenoms'][prenoms['Sexe']=='M']
prenoms_hommes
### END SOLUTION
0         Isaac
1          Paul
4         Aylan
5          Zayn
9          Orso
          ...  
26666     Nahel
26670    Aharon
26671      Theo
26676     Marlo
26677     Viggo
Name: Prenoms, Length: 13095, dtype: object
### BEGIN SOLUTION
len(prenoms_hommes)
### END SOLUTION
13095

A quick check:

assert len(prenoms_hommes) + len(prenoms_femmes) == len(prenoms)

Exercise: Transform the first names to uppercase in the table prenoms. Assign the result

to the variable prenoms_maj

Objective: Know how to search in the pandas documentation:

https://pandas.pydata.org/docs/user_guide/.

### BEGIN SOLUTION
# Solution apres avoir cherche dans "working with text" in the user guide
prenoms_maj = prenoms['Prenoms'].str.upper()
### END SOLUTION

Exercise \(\clubsuit\)

What is the most declared first name, cumulatively across all years? Assign it to the

variable prenom

Hint: consult the first example at the end of the documentation of the

groupby method to calculate the cumulative numbers by first name (with sum). Then use

DataGrid to visualize the result and sort it, or use sort_values, or

idxmax

### BEGIN SOLUTION
df = prenoms.groupby('Prenoms').sum()
# Solution 1
DataGrid(df)
# Solution 2
df.sort_values("Nombre prénoms déclarés")
### END SOLUTION
Nombre prénoms déclarés Sexe Annee Nombre total cumule par annee
Prenoms
Norhane 5 F 2013 5
Celina 5 F 2008 5
Chad 5 M 2004 5
Chaden 5 F 2015 5
Saly 5 F 2018 5
... ... ... ... ...
Arthur 5235 MMMMMMMMMMMMMMMMMMMMM 42294 5235
Louise 5373 FFFFFFFFFFFFFFFFFFFFF 42294 5373
Adam 5857 MMMMMMMMMMMMMMMMMMMMM 42294 5857
Raphaël 5887 MMMMMMMMMMMMMMMMMMMMM 42294 5887
Gabriel 7087 MMMMMMMMMMMMMMMMMMMMM 42294 7087

2823 rows × 4 columns

### BEGIN SOLUTION
# Solution 3
prenom = df['Nombre prénoms déclarés'].idxmax()
### END SOLUTION

This test verifies that you have found the correct answer; without giving it to you; the magic of hash functions :-)

import hashlib
assert hashlib.md5(prenom.encode("utf-8")).hexdigest() == 'b70e2a0d855b4dc7b1ea34a8a9d10305'

Conclusion#

Todo

qu’est-ce qui a été vu dans cette feuille?