Pandas is an open-source python library that provides data structures and is designed to handle and analyze tabular data in Python. Pandas is based on NumPy, which allows it to integrate well into the data science ecosystem alongside other libraries such as Scikit-learn
and Matplotlib
.
Specifically, the key points of this library are:
Series
which are labeled one-dimensional arrays, similar to a vector, list or sequence and which is able to contain any type of data, and the DataFrames
, which is a labeled two-dimensional structure with columns that can be of different types, similar to a spreadsheet or a SQL table.Pandas is a fundamental tool for any developer working with data in Python, as it provides a wide variety of tools for data exploration, cleaning and transformation, making the analysis process more efficient and effective.
Pandas provides two main data structures: Series
and DataFrames
.
A series in Pandas is a one-dimensional labeled data structure. It is similar to a 1D array in NumPy, but has an index that allows access to the values by label. A series can contain any kind of data: integers, strings, Python objects...
A Pandas series has two distinct parts:
A series can be created using the Series
class of the library with a list of elements as an argument. For example:
import pandas as pd
serie = pd.Series([1, 2, 3, 4, 5])
serie
This will create a series with elements 1, 2, 3, 4 and 5. In addition, since we have not included information about the indexes, an automatic index is generated starting at 0:
serie = pd.Series([1, 2, 3, 4, 5], index = ["a", "b", "c", "d", "e"])
serie
Thus, the previous series has an index composed of letters.
Both series store the same values, but the way they are accessed may vary according to the index.
In a series, its elements can be accessed by index or by position (the latter is what we did in NumPy). Below are some operations that can be performed using the above series:
# Access the third element
print(serie["c"]) # By index
print(serie[2]) # By position
# Change the value of the second element
serie["b"] = 7
print(serie)
# Add 10 to all elements
serie += 10
print(serie)
# Calculate the sum of the elements
sum_all = serie.sum()
print(sum_all)
A DataFrame in Pandas is a two-dimensional labeled data structure. It is similar to a 2D array in NumPy, but has an index that allows access to the values per label, per row, and column.
A DataFrame in Pandas has several differentiated parts:
A DataFrame can be seen as a set of series joined in a tabular structure, with an index per row in common and a column index specific to each series.
A DataFrame can be created using the DataFrame
class. For example:
dataframe = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
dataframe
This will create a DataFrame with three rows and three columns for each row. As was the case with series, a DataFrame will generate automatic indexes for rows and columns if they are not passed as arguments in the constructor of the class. If we wanted to create a new DataFrame with concrete indexes for rows and columns, it would be programmed as follows:
data = {
"col A": [1, 2, 3],
"col B": [4, 5, 6],
"col C": [7, 8, 9]
}
dataframe = pd.DataFrame(data, index = ["a", "b", "c"])
dataframe
In this way, a custom index is provided for the columns (labeling the rows within a dictionary) and for the rows (with the index
argument, as was the case with the series).
In a DataFrame its elements can be accessed by index or by position. Below are some operations that can be performed using the above DataFrame:
# Access all the data in a column
print(dataframe["col A"]) # By index
print(dataframe.loc[:,"col A"]) # By index
print(dataframe.iloc[:,0]) # By position
# Access all the data in a row
print(dataframe.loc["a"]) # By index
print(dataframe.iloc[0]) # By position
# Access to a specific element (row, column)
print(dataframe.loc["a", "col A"]) # By index
print(dataframe.iloc[0, 0]) # By position
# Create a new column
dataframe["col D"] = [10, 11, 12]
print(dataframe)
# Create a new row
dataframe.loc["d"] = [13, 14, 15, 16]
print(dataframe)
# Multiply by 10 the elements of a column
dataframe["col A"] *= 10
print(dataframe)
# Calculate the sum of all elements
sum_all = dataframe.sum()
print(sum_all)
Pandas provide a large number of predefined functions that can be applied on the data structures seen above. Some of the most used in data analysis are:
import pandas as pd
s1 = pd.Series([1, 2, 3])
s2 = pd.Series([4, 5, 6])
d1 = pd.DataFrame([[1, 2, 3], [4, 5, 6]])
d2 = pd.DataFrame([[7, 8, 9], [10, 11, 12]])
# Arithmetic Operations
print("Sum of series:", s1.add(s2))
print("Sum of DataFrames:", d1.add(d2))
# Statistical Operations
# They can be applied in the same way to DataFrames
print("Mean:", s1.mean())
print("Median:", s1.median())
print("Number of elements:", s1.count())
print("Standard deviation:", s1.std())
print("Variance:", s1.var())
print("Maximum value:", s1.max())
print("Minimum value:", s1.min())
print("Correlation:", s1.corr(s2))
print("Statistic summary:", s1.describe())
In addition to the Pandas predefined functions, we can also define and apply others to the data structures. To do this, we have to program the function to receive a value (or a column or row in the case of a DataFrame) and return another modified one, and reference it with apply
.
In addition, this function allows using lambda expressions for the anonymous declaration of functions.
The following shows how to apply functions to series:
import pandas as pd
s = pd.Series([1, 2, 3, 4])
# Explicit definition of the function
def squared(x):
return x ** 2
s1 = s.apply(squared)
print(s1)
# Anonymous definition of the function
s2 = s.apply(lambda x: x ** 2)
print(s2)
The following shows how to apply functions to a DataFrame, which can be done by row, by column or by elements, similar to series:
df = pd.DataFrame({
"A": [1, 2, 3],
"B": [4, 5, 6]
})
# Apply function along a column
df["A"] = df["A"].apply(lambda x: x ** 2)
print(df)
# Apply function along a row
df.loc[0] = df.loc[0].apply(lambda x: x ** 2)
print(df)
# Apply function to all elements
df = df.applymap(lambda x: x ** 2)
print(df)
apply
is more flexible than other vectorized Pandas functions, but can be slower, especially when applied to large data sets. It is always important to explore the Pandas or NumPy built-in functions first, as they are usually more efficient than the ones we could implement ourselves.
Also, this function can return results in different ways, depending on the function applied and how it is configured.
Click on Open in Colab to do the exercises
🛟 Solutions: In this link you can find the solutions for the following pandas exercises.
NOTE: Review the class
pd.Series
(https://pandas.pydata.org/docs/reference/api/pandas.Series.html)
NOTE: Review the class
pd.DataFrame
(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)
NOTE: Review the functions
pd.concat
(https://pandas.pydata.org/docs/reference/api/pandas.concat.html) andpd.Series.to_frame
(https://pandas.pydata.org/docs/reference/api/pandas.Series.to_frame.html)
NOTE: Review the function
pd.Series.isin
(https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html)
NOTE: Review the function
pd.DataFrame.sort_values
(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)
N_column
where N
is the column number (★★☆)¶NOTE: Review the function
pd.DataFrame.sort_values
(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)
NOTE: Review the function
pd.DataFrame.sort_values
(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)