Self-paced

Explore our extensive collection of courses designed to help you master various subjects and skills. Whether you're a beginner or an advanced learner, there's something here for everyone.

Bootcamp

Learn live

Join us for our free workshops, webinars, and other events to learn more about our programs and get started on your journey to becoming a developer.

Upcoming live events

Learning library

For all the self-taught geeks out there, here is our content library with most of the learning materials we have produced throughout the years.

It makes sense to start learning by reading and watching videos about fundamentals and how things work.

Search from all Lessons


LoginGet Started
← Back to Lessons
Edit on Github
Open in Colab

Introduction to Pandas

Pandas logo

Introduction to Pandas

Pandas is an open-source library that provides data structures and is designed to handle and analyze tabular data in Python. Pandas is based on NumPy, which allows it to integrate well into the data science ecosystem alongside other libraries such as Scikit-learn and Matplotlib.

Specifically, the key points of this library are:

  • Data structures: This library provides two structures for working with data. These are the Series which are labeled one-dimensional arrays, similar to a vector, list or sequence and which is able to contain any type of data, and the DataFrames, which is a labeled two-dimensional structure with columns that can be of different types, similar to a spreadsheet or a SQL table.
  • Data manipulation: Pandas allows you to carry out an exhaustive data analysis through functions that can be applied directly on your data structures. These operations include missing data control, data filtering, merging, combining and joining data from different sources...
  • Efficiency: All operations and/or functions that are applied on data structures are vectorized to improve performance compared to traditional Python loops and iterators.

Pandas is a fundamental tool for any developer working with data in Python, as it provides a wide variety of tools for data exploration, cleaning and transformation, making the analysis process more efficient and effective.

Data Structures

Pandas provides two main data structures: Series and DataFrames.

Series

A series in Pandas is a one-dimensional labeled data structure. It is similar to a 1D array in NumPy, but has an index that allows access to the values by label. A series can contain any kind of data: integers, strings, Python objects...

Example of a series

A Pandas series has two distinct parts:

  • Index (index): An array of tags associated with the data.
  • Value (value): An array of data.

A series can be created using the Series class of the library with a list of elements as an argument. For example:

In [1]:
import pandas as pd

serie = pd.Series([1, 2, 3, 4, 5])
serie
Out[1]:
0    1
1    2
2    3
3    4
4    5
dtype: int64

This will create a series with elements 1, 2, 3, 4 and 5. In addition, since we have not included information about the indexes, an automatic index is generated starting at 0:

In [2]:
serie = pd.Series([1, 2, 3, 4, 5], index = ["a", "b", "c", "d", "e"])
serie
Out[2]:
a    1
b    2
c    3
d    4
e    5
dtype: int64

Thus, the previous series has an index composed of letters.

Both series store the same values, but the way they are accessed may vary according to the index.

In a series, its elements can be accessed by index or by position (the latter is what we did in NumPy). Below are some operations that can be performed using the above series:

In [3]:
# Access the third element
print(serie["c"]) # By index
print(serie[2]) # By position

# Change the value of the second element
serie["b"] = 7
print(serie)

# Add 10 to all elements
serie += 10
print(serie)

# Calculate the sum of the elements
sum_all = serie.sum()
print(sum_all)
3
3
a    1
b    7
c    3
d    4
e    5
dtype: int64
a    11
b    17
c    13
d    14
e    15
dtype: int64
70

DataFrame

A DataFrame in Pandas is a two-dimensional labeled data structure. It is similar to a 2D array in NumPy, but has an index that allows access to the values per label, per row, and column.

Example of a DataFrame

A DataFrame in Pandas has several differentiated parts:

  • Data (data): An array of values that can be of different types per column.
  • Row index (row index): An array of labels associated to the rows.
  • Column index (column index): An array of labels associated to the columns.

A DataFrame can be seen as a set of series joined in a tabular structure, with an index per row in common and a column index specific to each series.

Series and DataFrames

A DataFrame can be created using the DataFrame class. For example:

In [4]:
dataframe = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
dataframe
Out[4]:
012
0123
1456
2789

This will create a DataFrame with three rows and three columns for each row. As was the case with series, a DataFrame will generate automatic indexes for rows and columns if they are not passed as arguments in the constructor of the class. If we wanted to create a new DataFrame with concrete indexes for rows and columns, it would be programmed as follows:

In [5]:
data = {
    "col A": [1, 2, 3],
    "col B": [4, 5, 6],
    "col C": [7, 8, 9]
}

dataframe = pd.DataFrame(data, index = ["a", "b", "c"])
dataframe
Out[5]:
col Acol Bcol C
a147
b258
c369

In this way, a custom index is provided for the columns (labeling the rows within a dictionary) and for the rows (with the index argument, as was the case with the series).

In a DataFrame its elements can be accessed by index or by position. Below are some operations that can be performed using the above DataFrame:

In [6]:
# Access all the data in a column
print(dataframe["col A"]) # By index
print(dataframe.loc[:,"col A"]) # By index
print(dataframe.iloc[:,0]) # By position

# Access all the data in a row
print(dataframe.loc["a"]) # By index
print(dataframe.iloc[0]) # By position

# Access to a specific element (row, column)
print(dataframe.loc["a", "col A"]) # By index
print(dataframe.iloc[0, 0]) # By position

# Create a new column
dataframe["col D"] = [10, 11, 12]
print(dataframe)

# Create a new row
dataframe.loc["d"] = [13, 14, 15, 16]
print(dataframe)

# Multiply by 10 the elements of a column
dataframe["col A"] *= 10
print(dataframe)

# Calculate the sum of all elements
sum_all = dataframe.sum()
print(sum_all)
a    1
b    2
c    3
Name: col A, dtype: int64
a    1
b    2
c    3
Name: col A, dtype: int64
a    1
b    2
c    3
Name: col A, dtype: int64
col A    1
col B    4
col C    7
Name: a, dtype: int64
col A    1
col B    4
col C    7
Name: a, dtype: int64
1
1
   col A  col B  col C  col D
a      1      4      7     10
b      2      5      8     11
c      3      6      9     12
   col A  col B  col C  col D
a      1      4      7     10
b      2      5      8     11
c      3      6      9     12
d     13     14     15     16
   col A  col B  col C  col D
a     10      4      7     10
b     20      5      8     11
c     30      6      9     12
d    130     14     15     16
col A    190
col B     29
col C     39
col D     49
dtype: int64

Functions

Pandas provide a large number of predefined functions that can be applied on the data structures seen above. Some of the most used in data analysis are:

In [7]:
import pandas as pd

s1 = pd.Series([1, 2, 3])
s2 = pd.Series([4, 5, 6])
d1 = pd.DataFrame([[1, 2, 3], [4, 5, 6]])
d2 = pd.DataFrame([[7, 8, 9], [10, 11, 12]])

# Arithmetic Operations
print("Sum of series:", s1.add(s2))
print("Sum of DataFrames:", d1.add(d2))

# Statistical Operations
# They can be applied in the same way to DataFrames
print("Mean:", s1.mean())
print("Median:", s1.median())
print("Number of elements:", s1.count())
print("Standard deviation:", s1.std())
print("Variance:", s1.var())
print("Maximum value:", s1.max())
print("Minimum value:", s1.min())
print("Correlation:", s1.corr(s2))
print("Statistic summary:", s1.describe())
Sum of series: 0    5
1    7
2    9
dtype: int64
Sum of DataFrames:     0   1   2
0   8  10  12
1  14  16  18
Mean: 2.0
Mediaa: 2.0
Number of elements: 3
Standard derviation: 1.0
Variance: 1.0
Maximum value: 3
Minimum value: 1
Correlation: 1.0
Statistic summary: count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64

Custom Functions

In addition to the Pandas predefined functions, we can also define and apply others to the data structures. To do this, we have to program the function to receive a value (or a column or row in the case of a DataFrame) and return another modified one, and reference it with apply.

In addition, this function allows using lambda expressions for the anonymous declaration of functions.

The following shows how to apply functions to series:

In [8]:
import pandas as pd
s = pd.Series([1, 2, 3, 4])

# Explicit definition of the function
def squared(x):
    return x ** 2
s1 = s.apply(squared)
print(s1)

# Anonymous definition of the function
s2 = s.apply(lambda x: x ** 2)
print(s2)
0     1
1     4
2     9
3    16
dtype: int64
0     1
1     4
2     9
3    16
dtype: int64

The following shows how to apply functions to a DataFrame, which can be done by row, by column or by elements, similar to series:

In [9]:
df = pd.DataFrame({
    "A": [1, 2, 3],
    "B": [4, 5, 6]
})

# Apply function along a column
df["A"] = df["A"].apply(lambda x: x ** 2)
print(df)

# Apply function along a row
df.loc[0] = df.loc[0].apply(lambda x: x ** 2)
print(df)

# Apply function to all elements
df = df.applymap(lambda x: x ** 2)
print(df)
   A  B
0  1  4
1  4  5
2  9  6
   A   B
0  1  16
1  4   5
2  9   6
    A    B
0   1  256
1  16   25
2  81   36

apply is more flexible than other vectorized Pandas functions, but can be slower, especially when applied to large data sets. It is always important to explore the Pandas or NumPy built-in functions first, as they are usually more efficient than the ones we could implement ourselves.

Also, this function can return results in different ways, depending on the function applied and how it is configured.

Exercises

Click on Open in Colab to do the exercises

Solution: https://github.com/4GeeksAcademy/machine-learning-prework/blob/main/03-pandas/03.1-Intro-to-Pandas_solutions.ipynb

Creation of Series and DataFrames

Exercise 01: Create a Series from a list, a NumPy array and a dictionary (★☆☆)

NOTE: Review the class pd.Series (https://pandas.pydata.org/docs/reference/api/pandas.Series.html)

In [ ]:

Exercise 02: Create a DataFrame from a NumPy array, a dictionary and a list of tuples (★☆☆)

NOTE: Review the class pd.DataFrame (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)

In [ ]:

Exercise 03: Create 2 Series and use them to build a DataFrame (★☆☆)

NOTE: Review the functions pd.concat (https://pandas.pydata.org/docs/reference/api/pandas.concat.html) and pd.Series.to_frame (https://pandas.pydata.org/docs/reference/api/pandas.Series.to_frame.html)

In [ ]:

Filtering and updating

Exercise 04: Use the Series created in the previous exercise and select the positions of the elements of the first Series that are in the second Series (★★☆)

NOTE: Review the function pd.Series.isin (https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html)

In [ ]:

Exercise 05: Use the series created in exercise 03 and list the elements that are not common between both series (★★☆)

In [ ]:

Exercise 06: Create a DataFrame of random numbers with 5 columns and 10 rows and sort one of its columns from smallest to largest (★★☆)

NOTE: Review the function pd.DataFrame.sort_values (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)

In [ ]:

Exercise 07: Modify the name of the 5 columns of the above DataFrame to the following format: N_column where N is the column number (★★☆)

NOTE: Review the function pd.DataFrame.sort_values (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)

In [ ]:

Exercise 08: Modify the index of the rows of the DataFrame of exercise 06 (★★☆)

NOTE: Review the function pd.DataFrame.sort_values (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)

In [ ]: