python

pandas

machine-learning

**Pandas** is an open-source python library that provides data structures and is designed to handle and analyze tabular data in Python. Pandas is based on NumPy, which allows it to integrate well into the data science ecosystem alongside other libraries such as `Scikit-learn`

and `Matplotlib`

.

Specifically, the key points of this library are:

**Data structures**: This library provides two structures for working with data. These are the`Series`

which are labeled one-dimensional arrays, similar to a vector, list or sequence and which is able to contain any type of data, and the`DataFrames`

, which is a labeled two-dimensional structure with columns that can be of different types, similar to a spreadsheet or a SQL table.**Data manipulation**: Pandas allows you to carry out an exhaustive data analysis through functions that can be applied directly on your data structures. These operations include missing data control, data filtering, merging, combining and joining data from different sources...**Efficiency**: All operations and/or functions that are applied on data structures are vectorized to improve performance compared to traditional Python loops and iterators.

Pandas is a fundamental tool for any developer working with data in Python, as it provides a wide variety of tools for data exploration, cleaning and transformation, making the analysis process more efficient and effective.

Pandas provides two main data structures: `Series`

and `DataFrames`

.

A **series** in Pandas is a one-dimensional labeled data structure. It is similar to a 1D array in NumPy, but has an index that allows access to the values by label. A series can contain any kind of data: integers, strings, Python objects...

A Pandas series has two distinct parts:

**Index**(*index*): An array of tags associated with the data.**Value**(*value*): An array of data.

A series can be created using the `Series`

class of the library with a list of elements as an argument. For example:

In [1]:

```
import pandas as pd
serie = pd.Series([1, 2, 3, 4, 5])
serie
```

Out[1]:

This will create a series with elements 1, 2, 3, 4 and 5. In addition, since we have not included information about the indexes, an automatic index is generated starting at 0:

In [2]:

```
serie = pd.Series([1, 2, 3, 4, 5], index = ["a", "b", "c", "d", "e"])
serie
```

Out[2]:

Thus, the previous series has an index composed of letters.

Both series store the same values, but the way they are accessed may vary according to the index.

In a series, its elements can be accessed by index or by position (the latter is what we did in NumPy). Below are some operations that can be performed using the above series:

In [3]:

```
# Access the third element
print(serie["c"]) # By index
print(serie[2]) # By position
# Change the value of the second element
serie["b"] = 7
print(serie)
# Add 10 to all elements
serie += 10
print(serie)
# Calculate the sum of the elements
sum_all = serie.sum()
print(sum_all)
```

A **DataFrame** in Pandas is a two-dimensional labeled data structure. It is similar to a 2D array in NumPy, but has an index that allows access to the values per label, per row, and column.

A DataFrame in Pandas has several differentiated parts:

**Data**(*data*): An array of values that can be of different types per column.**Row index**(*row index*): An array of labels associated to the rows.**Column index**(*column index*): An array of labels associated to the columns.

A DataFrame can be seen as a set of series joined in a tabular structure, with an index per row in common and a column index specific to each series.

A DataFrame can be created using the `DataFrame`

class. For example:

In [4]:

```
dataframe = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
dataframe
```

Out[4]:

This will create a DataFrame with three rows and three columns for each row. As was the case with series, a DataFrame will generate automatic indexes for rows and columns if they are not passed as arguments in the constructor of the class. If we wanted to create a new DataFrame with concrete indexes for rows and columns, it would be programmed as follows:

In [5]:

```
data = {
"col A": [1, 2, 3],
"col B": [4, 5, 6],
"col C": [7, 8, 9]
}
dataframe = pd.DataFrame(data, index = ["a", "b", "c"])
dataframe
```

Out[5]:

In this way, a custom index is provided for the columns (labeling the rows within a dictionary) and for the rows (with the `index`

argument, as was the case with the series).

In a DataFrame its elements can be accessed by index or by position. Below are some operations that can be performed using the above DataFrame:

In [6]:

```
# Access all the data in a column
print(dataframe["col A"]) # By index
print(dataframe.loc[:,"col A"]) # By index
print(dataframe.iloc[:,0]) # By position
# Access all the data in a row
print(dataframe.loc["a"]) # By index
print(dataframe.iloc[0]) # By position
# Access to a specific element (row, column)
print(dataframe.loc["a", "col A"]) # By index
print(dataframe.iloc[0, 0]) # By position
# Create a new column
dataframe["col D"] = [10, 11, 12]
print(dataframe)
# Create a new row
dataframe.loc["d"] = [13, 14, 15, 16]
print(dataframe)
# Multiply by 10 the elements of a column
dataframe["col A"] *= 10
print(dataframe)
# Calculate the sum of all elements
sum_all = dataframe.sum()
print(sum_all)
```

Pandas provide a large number of predefined functions that can be applied on the data structures seen above. Some of the most used in data analysis are:

In [7]:

```
import pandas as pd
s1 = pd.Series([1, 2, 3])
s2 = pd.Series([4, 5, 6])
d1 = pd.DataFrame([[1, 2, 3], [4, 5, 6]])
d2 = pd.DataFrame([[7, 8, 9], [10, 11, 12]])
# Arithmetic Operations
print("Sum of series:", s1.add(s2))
print("Sum of DataFrames:", d1.add(d2))
# Statistical Operations
# They can be applied in the same way to DataFrames
print("Mean:", s1.mean())
print("Median:", s1.median())
print("Number of elements:", s1.count())
print("Standard deviation:", s1.std())
print("Variance:", s1.var())
print("Maximum value:", s1.max())
print("Minimum value:", s1.min())
print("Correlation:", s1.corr(s2))
print("Statistic summary:", s1.describe())
```

In addition to the Pandas predefined functions, we can also define and apply others to the data structures. To do this, we have to program the function to receive a value (or a column or row in the case of a DataFrame) and return another modified one, and reference it with `apply`

.

In addition, this function allows using **lambda expressions** for the anonymous declaration of functions.

The following shows how to apply functions to series:

In [8]:

```
import pandas as pd
s = pd.Series([1, 2, 3, 4])
# Explicit definition of the function
def squared(x):
return x ** 2
s1 = s.apply(squared)
print(s1)
# Anonymous definition of the function
s2 = s.apply(lambda x: x ** 2)
print(s2)
```

The following shows how to apply functions to a DataFrame, which can be done by row, by column or by elements, similar to series:

In [9]:

```
df = pd.DataFrame({
"A": [1, 2, 3],
"B": [4, 5, 6]
})
# Apply function along a column
df["A"] = df["A"].apply(lambda x: x ** 2)
print(df)
# Apply function along a row
df.loc[0] = df.loc[0].apply(lambda x: x ** 2)
print(df)
# Apply function to all elements
df = df.applymap(lambda x: x ** 2)
print(df)
```

`apply`

is more flexible than other vectorized Pandas functions, but can be slower, especially when applied to large data sets. It is always important to explore the Pandas or NumPy built-in functions first, as they are usually more efficient than the ones we could implement ourselves.

Also, this function can return results in different ways, depending on the function applied and how it is configured.

Click on Open in Colab to do the exercises

Solutions: In this link you can find the solutions for the following pandas exercises.

NOTE: Review the class

`pd.Series`

(https://pandas.pydata.org/docs/reference/api/pandas.Series.html)

In [ ]:

```
```

NOTE: Review the class

`pd.DataFrame`

(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)

In [ ]:

```
```

NOTE: Review the functions

`pd.concat`

(https://pandas.pydata.org/docs/reference/api/pandas.concat.html) and`pd.Series.to_frame`

(https://pandas.pydata.org/docs/reference/api/pandas.Series.to_frame.html)

In [ ]:

```
```

NOTE: Review the function

`pd.Series.isin`

(https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html)

In [ ]:

```
```

In [ ]:

```
```

NOTE: Review the function

`pd.DataFrame.sort_values`

(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)

In [ ]:

```
```

`N_column`

where `N`

is the column number (★★☆)¶NOTE: Review the function

`pd.DataFrame.sort_values`

(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)

In [ ]:

```
```

NOTE: Review the function

`pd.DataFrame.sort_values`

(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)

In [ ]:

```
```