Data tables with Pandas

Data tables with Pandas#

Todo

Mystification

Data Manipulation#

We will learn to manipulate data, that is, a table with

attributes in columns and instances in rows in Python.

Pandas Library#

Pandas is a Python library that allows you to:

manipulate data tables, and
perform statistical analyses

as you could with a spreadsheet (Excel for example), but

programmatically, thus allowing automation.

Data tables (type DataFrame) are tables with two (or more)

dimensions, with labels (labels) on the rows and columns. The data is not necessarily

homogeneous: from one column to another, the type of data can

change (strings, floats, integers, etc.); in addition, some data

may be missing.

Series (type Series) are one-dimensional tables (vectors),

typically obtained by extracting a column from a DataFrame.

In this course, by table, we will mean a two-dimensional table of type

DataFrame, while by series we will mean a one-dimensional table of type

Series.

These concepts of DataFrame and Series are found in other libraries or

data analysis systems such as R.

In addition, Pandas allows you to process massive data distributed over very many

computers by relying on parallelism libraries such as

dask

Data Series#

We must begin by importing the pandas library; it is traditional to define a shortcut pd:

import pandas as pd

Let’s construct a series of temperatures:

temperatures = pd.Series([8.3, 10.5, 4.4, 2.9, 5.7, 11.1], name="Température")
temperatures

   8.3
  10.5
   4.4
   2.9
   5.7
  11.1
Name: Température, dtype: float64

Note that the series is considered as a column and the indices of its

rows are displayed. By default, these are the integers \(0,1,\ldots\), but other indices

are possible. As with C++ arrays or Python lists, we use the notation

‘t[i]’ to extract the row with index i:

temperatures[3]

np.float64(2.9)

The size of the series is obtained in the traditional way with Python using the

len function:

len(temperatures)

Let’s calculate the average temperature using the method mean:

temperatures.mean()

np.float64(7.1499999999999995)

Calculate the maximum temperature using the max method:

### BEGIN SOLUTION
temperatures.max()
### END SOLUTION

np.float64(11.1)

Calculate the minimum temperature:

### BEGIN SOLUTION
temperatures.min()
### END SOLUTION

np.float64(2.9)

What is the temperature data range?

### BEGIN SOLUTION
temperatures.max() - temperatures.min()
### END SOLUTION

np.float64(8.2)

Data tables (`DataFrame`)#

We will now construct a table containing the acidity data of water from

several wells. It will have two columns: one for the name of the wells and the other for the

pH value (acidity).

We can now build the table from the list of well names and

the list of well pH values:

df = pd.DataFrame({
    "Noms" : ['P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7', 'P8', 'P9', 'P10'],
    "pH"   : [ 7.0,  6.5,  6.8,  7.1,  8.0,  7.0,  7.1,  6.8,  7.1,  7.1 ]
})
df

	Noms	pH
0	P1	7.0
1	P2	6.5
2	P3	6.8
3	P4	7.1
4	P5	8.0
5	P6	7.0
6	P7	7.1
7	P8	6.8
8	P9	7.1
9	P10	7.1

You will notice that:

It is traditional to name df (for DataFrame) the variable containing the table. But it is desirable to use a better name whenever natural!
The first column of the table gives the index of the rows. By default, it is their number, starting at 0, in addition to the “Names” and “pH” columns.

This two-dimensional table is seen as a collection of columns. As a result,

df[label] extracts the column with the label label, in the form of a series:

df['Noms']

   P1
   P2
   P3
   P4
   P5
   P6
   P7
   P8
   P9
  P10
Name: Noms, dtype: object

You can then access each of the table’s values by specifying the label of

its column and then the index of its row. Here is the name in the second row (index 1):

df['Noms'][1]

'P2'

And the pH value in the fourth line (index 3):

df['pH'][3]

np.float64(7.1)

Again, you will notice that the access is in the form `df[colonne][ligne]` whereas

with a C++ array the access would be in the form t[ligne][colonne].

Metadata and Statistics#

We now use Pandas to extract some metadata and statistics from

our data.

First, the size of the table:

df.shape

(10, 2)

Column titles:

df.columns

Index(['Noms', 'pH'], dtype='object')

Number of lines:

len(df)

General information:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Noms    10 non-null     object 
 1   pH      10 non-null     float64
dtypes: float64(1), object(1)
memory usage: 292.0+ bytes

The average of each column for which it makes sense, that is to say only the

columns containing numerical values (here only the pH):

df.mean(numeric_only = True)

pH    7.05
dtype: float64

Standard deviations:

df.std(numeric_only = True)

pH    0.38658
dtype: float64

The median:

df.median(numeric_only = True)

pH    7.05
dtype: float64

The 25% quantile:

df.quantile(.25, numeric_only=True)

pH    6.85
Name: 0.25, dtype: float64

The minimum and maximum values:

df.min()

Noms     P1
pH      6.5
dtype: object

df.max()

Noms     P9
pH      8.0
dtype: object

A summary of the key statistics:

df.describe()

	pH
count	10.00000
mean	7.05000
std	0.38658
min	6.50000
25%	6.85000
50%	7.05000
75%	7.10000
max	8.00000

The maximum pH index:

df['pH'].idxmax()

A histogram of the pH. Note: Here, we have the option bins = 20, which means that we

divide the range of pH (the x*-axis*) into 20 groups of the same size. Change the value of

bins to 2 and see the difference:

df['pH'].hist(bins=20);

../_images/a4cb8e42b055ebb09629f8547deb413115dba50d48f68b9c7a698b655415c4bd.png

With pyplot, it is possible to refine the result by adding labels to the axes,

etc:

import matplotlib.pyplot as plt

df['pH'].plot(kind='hist')
plt.grid()
plt.ylabel('Counts')
plt.xlabel('pH');

../_images/daee8ab744cb2d0361f5cad9338fbbddeb1612374b384af22e8c35b356d914c4.png

Database operations#

Pandas allows you to perform database operations (for those who did the

Free Data project, remember “select”, “group by”, “join”, etc.). We

only show here the selection of rows; you will play with “group by” later.

Let’s transform the table so that one of the columns serves as indices; here we will

use the names:

df1 = df.set_index('Noms')
df1

	pH
Noms
P1	7.0
P2	6.5
P3	6.8
P4	7.1
P5	8.0
P6	7.0
P7	7.1
P8	6.8
P9	7.1
P10	7.1

It is now possible to access a pH value by directly using the name

as a row index:

df1['pH']['P1']

np.float64(7.0)

Let’s now select all the pH \(7.1\) lines:

df1[df1['pH'] == 7.1]

	pH
Noms
P4	7.1
P7	7.1
P9	7.1
P10	7.1

How does it work?

Let us denote pH the column with the same label:

pH = df1['pH']
pH

Noms
P1     7.0
P2     6.5
P3     6.8
P4     7.1
P5     8.0
P6     7.0
P7     7.1
P8     6.8
P9     7.1
P10    7.1
Name: pH, dtype: float64

As with NumPy, all operations are vectorized on all elements of the

array or series. Thus, if we write pH + 1 (which makes no

mathematical sense, the objects being of very different types: pH is a series while 1

is a number), this adds 1 to all the values of the series:

pH + 1

Noms
P1     8.0
P2     7.5
P3     7.8
P4     8.1
P5     9.0
P6     8.0
P7     8.1
P8     7.8
P9     8.1
P10    8.1
Name: pH, dtype: float64

Similarly, if we write pH == 7.1, this returns a series of booleans,

each indicating whether the corresponding value is equal to \(7.1\) or not:

pH == 7.1

Noms
P1     False
P2     False
P3     False
P4      True
P5     False
P6     False
P7      True
P8     False
P9      True
P10     True
Name: pH, dtype: bool

Finally, if you index an array by a series of booleans, it extracts the rows for which the series contains True:

df1[pH == 7.1]

	pH
Noms
P4	7.1
P7	7.1
P9	7.1
P10	7.1

Exercise: grades#

The Info 111 grades (anonymous!) from last year are in the CSV file notes_info_111.csv. Consult the content of this file:

You will notice that the values are separated by commas ‘,’ (CSV: Comma Separated

Value).

Here’s how to load this file as a Pandas table:

df = pd.read_csv("data/notes_info_111.csv", sep=",")

Display simple statistics in the:

### BEGIN SOLUTION
df.describe()
### END SOLUTION

	Total de Examen mi-semestre (EE) (Brut)	projet (CCTP) (Brut)	Note enseignant (Brut)	Outil externe Contrôle 1 PLaTon, jeudi 21 et vendredi 22 octobre (Brut)	Outil externe Contrôle 2 PLaTon, jeudi 9 - vendredi 10 décembre (Brut)	Total de Contrôles PLaTon (Brut)	Total de Note de TD (CCE) (Brut)	Total de Examen final (EEF) (Brut)	Total du cours (Brut)
count	276.000000	266.000000	272.000000	267.000000	267.000000	285.000000	286.000000	273.000000	286.000000
mean	13.715942	15.308271	15.978309	16.972846	17.319663	16.088035	15.631049	12.463004	13.420420
std	4.448350	4.625492	3.465690	4.306431	3.996111	5.528698	4.366991	5.763870	4.828153
min	0.600000	0.000000	5.000000	0.000000	0.000000	0.000000	0.000000	0.200000	0.000000
25%	10.400000	13.000000	14.000000	16.150000	16.560000	15.260000	14.085000	7.800000	9.997500
50%	14.000000	17.000000	16.550000	18.850000	19.060000	18.490000	16.915000	13.200000	14.395000
75%	17.800000	19.000000	18.000000	19.620000	19.690000	19.530000	18.497500	18.000000	17.537500
max	20.000000	20.000000	20.000000	20.000000	20.000000	20.000000	20.000000	20.000000	20.000000

With DataGrid, you can interactively explore the table; try some of the following:

filtering ( icon);
sorting (click on the column header to sort by).

todo

verify datagrid for next year

from ipydatagrid import DataGrid
DataGrid(df)

Exercise: First names#

This exercise involves analyzing a database concerning first names

given in Paris between 2004 and 2018 (memories of S1?). This data is freely

accessible also on the

opendata site of the city of

Paris. The following shell command will download them into the

liste_des_prenoms.csv file if it is not already present.

if [ ! -f liste_des_prenoms.csv ]; then
    curl --http1.1 "https://opendata.paris.fr/explore/dataset/liste_des_prenoms/download/?format=csv&timezone=Europe/Berlin&lang=fr&use_labels_for_header=true&csv_separator=%3B" -o data/liste_des_prenoms.csv
fi

Open the file to view its contents.
Using the example above as inspiration, import the file data/liste_des_prenoms.csv into a table prenoms. Note: the file uses semicolons ; as separators and not commas ,.

### BEGIN SOLUTION
prenoms = pd.read_csv('data/liste_des_prenoms.csv', sep=';')
### END SOLUTION

If the test below fails, please verify that you have correctly named your array

prenoms (without accent, with an s):

assert isinstance(prenoms, pd.DataFrame)
assert list(prenoms.columns) == ['Nombre prénoms déclarés', 'Sexe', 'Annee', 'Prenoms','Nombre total cumule par annee']

Display the first ten lines of the file. Hint: consult the documentation of the head method using prenoms.head?

### BEGIN SOLUTION
prenoms.head(10)
### END SOLUTION

	Nombre prénoms déclarés	Sexe	Annee	Prenoms	Nombre total cumule par annee
0	179	M	2024	Isaac	179
1	125	M	2024	Paul	125
2	83	F	2024	Lina	83
3	78	F	2024	Victoria	78
4	77	M	2024	Aylan	77
5	68	M	2024	Zayn	68
6	61	F	2024	Inaya	61
7	60	F	2024	Mariam	60
8	55	F	2024	Inès	55
9	48	M	2024	Orso	48

Display the lines corresponding to your first name (or another first name such as Mohammed, Maxime or even Brune if your first name is not in the list):

### BEGIN SOLUTION
prenoms[prenoms.Prenoms == "Brune"]
### END SOLUTION

	Nombre prénoms déclarés	Sexe	Annee	Prenoms	Nombre total cumule par annee
655	7	F	2006	Brune	7
712	11	F	2007	Brune	11
1037	20	F	2010	Brune	20
2785	25	F	2015	Brune	25
3601	53	F	2018	Brune	53
3771	26	F	2016	Brune	26
3869	50	F	2019	Brune	50
7527	42	F	2021	Brune	42
8048	44	F	2020	Brune	44
14110	27	F	2012	Brune	27
14246	23	F	2014	Brune	23
14918	20	F	2011	Brune	20
15955	6	F	2004	Brune	6
17181	26	F	2017	Brune	26
20948	18	F	2013	Brune	18
21774	16	F	2008	Brune	16
22038	10	F	2005	Brune	10
22830	18	F	2009	Brune	18
23317	55	F	2022	Brune	55
25488	40	F	2023	Brune	40
26008	42	F	2024	Brune	42

Do the same interactively with DataGrid:

DataGrid(prenoms)

Extract the female first names (with repetitions) into a variable prenoms_femmes, in the form of a series. Hint: see in the examples above how to select rows and how to extract a column.

### BEGIN SOLUTION
prenoms_femmes = prenoms['Prenoms'][prenoms['Sexe'] == 'F']
prenoms_femmes
### END SOLUTION

          Lina
      Victoria
         Inaya
        Mariam
          Inès
           ...   
    Saskia
     Edmée
     Enola
      Gala
     Perla
Name: Prenoms, Length: 13583, dtype: object

How many are there?

### BEGIN SOLUTION
len(prenoms_femmes)
### END SOLUTION

# Vérifie que prenoms_femmes est bien une série
assert isinstance(prenoms_femmes, pd.Series)

# Vérifie le premier prénom alphabetiquement
assert prenoms_femmes.min() == 'Aaliyah'
# Vérifie le nombre de prénoms après suppression des répétitions
assert len(set(prenoms_femmes)) == 1491

Do the same for men’s first names:

### BEGIN SOLUTION
prenoms_hommes = prenoms['Prenoms'][prenoms['Sexe']=='M']
prenoms_hommes
### END SOLUTION

       Isaac
        Paul
       Aylan
        Zayn
        Orso
          ...  
   Nahel
  Aharon
    Theo
   Marlo
   Viggo
Name: Prenoms, Length: 13095, dtype: object

### BEGIN SOLUTION
len(prenoms_hommes)
### END SOLUTION

A quick check:

assert len(prenoms_hommes) + len(prenoms_femmes) == len(prenoms)

Exercise: Transform the first names to uppercase in the table prenoms. Assign the result

to the variable prenoms_maj

Objective: Know how to search in the pandas documentation:

https://pandas.pydata.org/docs/user_guide/.

### BEGIN SOLUTION
# Solution apres avoir cherche dans "working with text" in the user guide
prenoms_maj = prenoms['Prenoms'].str.upper()
### END SOLUTION

Exercise \(\clubsuit\)

What is the most declared first name, cumulatively across all years? Assign it to the

variable prenom

Hint: consult the first example at the end of the documentation of the

groupby method to calculate the cumulative numbers by first name (with sum). Then use

DataGrid to visualize the result and sort it, or use sort_values, or

idxmax

### BEGIN SOLUTION
df = prenoms.groupby('Prenoms').sum()
# Solution 1
DataGrid(df)
# Solution 2
df.sort_values("Nombre prénoms déclarés")
### END SOLUTION

	Nombre prénoms déclarés	Sexe	Annee	Nombre total cumule par annee
Prenoms
Norhane	5	F	2013	5
Celina	5	F	2008	5
Chad	5	M	2004	5
Chaden	5	F	2015	5
Saly	5	F	2018	5
...	...	...	...	...
Arthur	5235	MMMMMMMMMMMMMMMMMMMMM	42294	5235
Louise	5373	FFFFFFFFFFFFFFFFFFFFF	42294	5373
Adam	5857	MMMMMMMMMMMMMMMMMMMMM	42294	5857
Raphaël	5887	MMMMMMMMMMMMMMMMMMMMM	42294	5887
Gabriel	7087	MMMMMMMMMMMMMMMMMMMMM	42294	7087

2823 rows × 4 columns

### BEGIN SOLUTION
# Solution 3
prenom = df['Nombre prénoms déclarés'].idxmax()
### END SOLUTION

This test verifies that you have found the correct answer; without giving it to you; the magic of hash functions :-)

import hashlib
assert hashlib.md5(prenom.encode("utf-8")).hexdigest() == 'b70e2a0d855b4dc7b1ea34a8a9d10305'

Conclusion#