Pandas logo

Introduction to Python Pandas¶

Pandas is an open-source python library that provides data structures and is designed to handle and analyze tabular data in Python. Pandas is based on NumPy, which allows it to integrate well into the data science ecosystem alongside other libraries such as Scikit-learn and Matplotlib.

Specifically, the key points of this library are:

Data structures: This library provides two structures for working with data. These are the Series which are labeled one-dimensional arrays, similar to a vector, list or sequence and which is able to contain any type of data, and the DataFrames, which is a labeled two-dimensional structure with columns that can be of different types, similar to a spreadsheet or a SQL table.
Data manipulation: Pandas allows you to carry out an exhaustive data analysis through functions that can be applied directly on your data structures. These operations include missing data control, data filtering, merging, combining and joining data from different sources...
Efficiency: All operations and/or functions that are applied on data structures are vectorized to improve performance compared to traditional Python loops and iterators.

Pandas is a fundamental tool for any developer working with data in Python, as it provides a wide variety of tools for data exploration, cleaning and transformation, making the analysis process more efficient and effective.

Data Structures in Python Pandas¶

Pandas provides two main data structures: Series and DataFrames.

Series¶

A series in Pandas is a one-dimensional labeled data structure. It is similar to a 1D array in NumPy, but has an index that allows access to the values by label. A series can contain any kind of data: integers, strings, Python objects...

Example of a series

A Pandas series has two distinct parts:

Index (index): An array of tags associated with the data.
Value (value): An array of data.

A series can be created using the Series class of the library with a list of elements as an argument. For example:

In [1]:

import pandas as pd

serie = pd.Series([1, 2, 3, 4, 5])
serie

Out[1]:

0    1
1    2
2    3
3    4
4    5
dtype: int64

This will create a series with elements 1, 2, 3, 4 and 5. In addition, since we have not included information about the indexes, an automatic index is generated starting at 0:

In [2]:

serie = pd.Series([1, 2, 3, 4, 5], index = ["a", "b", "c", "d", "e"])
serie

Out[2]:

a    1
b    2
c    3
d    4
e    5
dtype: int64

Thus, the previous series has an index composed of letters.

Both series store the same values, but the way they are accessed may vary according to the index.

In a series, its elements can be accessed by index or by position (the latter is what we did in NumPy). Below are some operations that can be performed using the above series:

In [3]:

# Access the third element
print(serie["c"]) # By index
print(serie[2]) # By position

# Change the value of the second element
serie["b"] = 7
print(serie)

# Add 10 to all elements
serie += 10
print(serie)

# Calculate the sum of the elements
sum_all = serie.sum()
print(sum_all)

3
3
a    1
b    7
c    3
d    4
e    5
dtype: int64
a    11
b    17
c    13
d    14
e    15
dtype: int64
70

Pandas DataFrame¶

A DataFrame in Pandas is a two-dimensional labeled data structure. It is similar to a 2D array in NumPy, but has an index that allows access to the values per label, per row, and column.

Example of a DataFrame

A DataFrame in Pandas has several differentiated parts:

Data (data): An array of values that can be of different types per column.
Row index (row index): An array of labels associated to the rows.
Column index (column index): An array of labels associated to the columns.

A DataFrame can be seen as a set of series joined in a tabular structure, with an index per row in common and a column index specific to each series.

Series and DataFrames

A DataFrame can be created using the DataFrame class. For example:

In [4]:

dataframe = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
dataframe

Out[4]:

	0	1	2
0	1	2	3
1	4	5	6
2	7	8	9

This will create a DataFrame with three rows and three columns for each row. As was the case with series, a DataFrame will generate automatic indexes for rows and columns if they are not passed as arguments in the constructor of the class. If we wanted to create a new DataFrame with concrete indexes for rows and columns, it would be programmed as follows:

In [5]:

data = {
    "col A": [1, 2, 3],
    "col B": [4, 5, 6],
    "col C": [7, 8, 9]
}

dataframe = pd.DataFrame(data, index = ["a", "b", "c"])
dataframe

Out[5]:

	col A	col B	col C
a	1	4	7
b	2	5	8
c	3	6	9

In this way, a custom index is provided for the columns (labeling the rows within a dictionary) and for the rows (with the index argument, as was the case with the series).

In a DataFrame its elements can be accessed by index or by position. Below are some operations that can be performed using the above DataFrame:

In [6]:

# Access all the data in a column
print(dataframe["col A"]) # By index
print(dataframe.loc[:,"col A"]) # By index
print(dataframe.iloc[:,0]) # By position

# Access all the data in a row
print(dataframe.loc["a"]) # By index
print(dataframe.iloc[0]) # By position

# Access to a specific element (row, column)
print(dataframe.loc["a", "col A"]) # By index
print(dataframe.iloc[0, 0]) # By position

# Create a new column
dataframe["col D"] = [10, 11, 12]
print(dataframe)

# Create a new row
dataframe.loc["d"] = [13, 14, 15, 16]
print(dataframe)

# Multiply by 10 the elements of a column
dataframe["col A"] *= 10
print(dataframe)

# Calculate the sum of all elements
sum_all = dataframe.sum()
print(sum_all)

a    1
b    2
c    3
Name: col A, dtype: int64
a    1
b    2
c    3
Name: col A, dtype: int64
a    1
b    2
c    3
Name: col A, dtype: int64
col A    1
col B    4
col C    7
Name: a, dtype: int64
col A    1
col B    4
col C    7
Name: a, dtype: int64
1
1
   col A  col B  col C  col D
a      1      4      7     10
b      2      5      8     11
c      3      6      9     12
   col A  col B  col C  col D
a      1      4      7     10
b      2      5      8     11
c      3      6      9     12
d     13     14     15     16
   col A  col B  col C  col D
a     10      4      7     10
b     20      5      8     11
c     30      6      9     12
d    130     14     15     16
col A    190
col B     29
col C     39
col D     49
dtype: int64

Functions in Python Pandas¶

Pandas provide a large number of predefined functions that can be applied on the data structures seen above. Some of the most used in data analysis are:

In [7]:

import pandas as pd

s1 = pd.Series([1, 2, 3])
s2 = pd.Series([4, 5, 6])
d1 = pd.DataFrame([[1, 2, 3], [4, 5, 6]])
d2 = pd.DataFrame([[7, 8, 9], [10, 11, 12]])

# Arithmetic Operations
print("Sum of series:", s1.add(s2))
print("Sum of DataFrames:", d1.add(d2))

# Statistical Operations
# They can be applied in the same way to DataFrames
print("Mean:", s1.mean())
print("Median:", s1.median())
print("Number of elements:", s1.count())
print("Standard deviation:", s1.std())
print("Variance:", s1.var())
print("Maximum value:", s1.max())
print("Minimum value:", s1.min())
print("Correlation:", s1.corr(s2))
print("Statistic summary:", s1.describe())

Sum of series: 0    5
1    7
2    9
dtype: int64
Sum of DataFrames:     0   1   2
0   8  10  12
1  14  16  18
Mean: 2.0
Mediaa: 2.0
Number of elements: 3
Standard derviation: 1.0
Variance: 1.0
Maximum value: 3
Minimum value: 1
Correlation: 1.0
Statistic summary: count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64

Pandas allows you to use custom python functions (including lambda)¶

In addition to the Pandas predefined functions, we can also define and apply others to the data structures. To do this, we have to program the function to receive a value (or a column or row in the case of a DataFrame) and return another modified one, and reference it with apply.

In addition, this function allows using lambda expressions for the anonymous declaration of functions.

The following shows how to apply functions to series:

In [8]:

import pandas as pd
s = pd.Series([1, 2, 3, 4])

# Explicit definition of the function
def squared(x):
    return x ** 2
s1 = s.apply(squared)
print(s1)

# Anonymous definition of the function
s2 = s.apply(lambda x: x ** 2)
print(s2)

0     1
1     4
2     9
3    16
dtype: int64
0     1
1     4
2     9
3    16
dtype: int64

The following shows how to apply functions to a DataFrame, which can be done by row, by column or by elements, similar to series:

In [9]:

df = pd.DataFrame({
    "A": [1, 2, 3],
    "B": [4, 5, 6]
})

# Apply function along a column
df["A"] = df["A"].apply(lambda x: x ** 2)
print(df)

# Apply function along a row
df.loc[0] = df.loc[0].apply(lambda x: x ** 2)
print(df)

# Apply function to all elements
df = df.applymap(lambda x: x ** 2)
print(df)

apply is more flexible than other vectorized Pandas functions, but can be slower, especially when applied to large data sets. It is always important to explore the Pandas or NumPy built-in functions first, as they are usually more efficient than the ones we could implement ourselves.

Also, this function can return results in different ways, depending on the function applied and how it is configured.

Start practicing the Pandas syntax in python righ now!¶

Click on Open in Colab to do the exercises

🛟 Solutions: In this link you can find the solutions for the following pandas exercises.

Creation of Series and Pandas DataFrames¶

Pandas Exercise 01: Create a Series from a list, a NumPy array and a dictionary (★☆☆)¶

NOTE: Review the class pd.Series (https://pandas.pydata.org/docs/reference/api/pandas.Series.html)

In [ ]:

Pandas Exercise 02: Create a DataFrame from a NumPy array, a dictionary and a list of tuples (★☆☆)¶

NOTE: Review the class pd.DataFrame (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)

In [ ]:

Pandas Exercise 03: Create 2 Series and use them to build a DataFrame (★☆☆)¶

NOTE: Review the functions pd.concat (https://pandas.pydata.org/docs/reference/api/pandas.concat.html) and pd.Series.to_frame (https://pandas.pydata.org/docs/reference/api/pandas.Series.to_frame.html)

In [ ]:

Filtering and updating¶

Exercise 04: Use the Series created in the previous exercise and select the positions of the elements of the first Series that are in the second Series (★★☆)¶

NOTE: Review the function pd.Series.isin (https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html)

In [ ]:

Pandas Exercise 05: Use the series created in exercise 03 and list the elements that are not common between both series (★★☆)¶

In [ ]:

Pandas Exercise 06: Create a DataFrame of random numbers with 5 columns and 10 rows and sort one of its columns from smallest to largest (★★☆)¶

NOTE: Review the function pd.DataFrame.sort_values (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)

In [ ]:

Pandas Exercise 07: Modify the name of the 5 columns of the above DataFrame to the following format: `N_column` where `N` is the column number (★★☆)¶

NOTE: Review the function pd.DataFrame.sort_values (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)

In [ ]:

Pandas Exercise 08: Modify the index of the rows of the DataFrame of exercise 06 (★★☆)¶

NOTE: Review the function pd.DataFrame.sort_values (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)

In [ ]: