Every week, we pick a real-life project to build your portfolio and get ready for a job. All projects are built with ChatGPT as co-pilot!

Start the ChallengeA tech-culture podcast where you learn to fight the enemies that blocks your way to become a successful professional in tech.

Listen the podcastData Analysis

Data Feature Engineering

Python

Machine Learning

**Exploratory data analysis** (EDA) is the first step to solving any Machine Learning problem. It consists of a process that seeks to analyze and investigate the available data sets and summarize their main characteristics, often using data visualization techniques. This analysis is carried out through a series of steps detailed below.

In this section, we will delve into the concept by working with the Titanic dataset.

Before starting to analyze the dataset, we must understand, on the one hand, the problem or challenge we are trying to solve with this information and how suitable or useful it can be for us.

In this case, we want to analyze which people did or did not survive the sinking of the Titanic and, in successive phases, be able to train a Machine Learning model to answer the question: "What kind of people were most likely to survive?". Therefore, we find that the dataset we have available can help us solve the question posed, and we apply an EDA process to learn more about it in detail.

We will import the dataset to start working with it:

In [1]:

```
import pandas as pd
train_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/titanic_train.csv")
test_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/titanic_test.csv")
test_survived_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/gender_submission.csv")
test_data["Survived"] = test_survived_data["Survived"]
total_data = pd.concat([train_data, test_data]).reset_index(inplace = False)
total_data.drop(columns = ["index"], inplace = True)
total_data.head()
```

Out[1]:

Once we have loaded the dataset, we must analyze it in its entirety, without disintegration of train and test, to obtain joint conclusions. Once we have the information loaded into a manageable data structure, such as a Pandas DataFrame, we can start with the process.

Knowing the dimensions and data types of the object we are working with is vital. For this, we need the `shape`

attribute to obtain the dimensions of the object and the `info()`

function to know the typology and the amount of non-null values:

In [2]:

```
# Obtain dimensions
total_data.shape
```

Out[2]:

In [3]:

```
# Obtain information about data types and non-null values
total_data.info()
```

Once we have obtained this information, it is important that we are able to draw conclusions, such as the following:

- There are a total of 1309 rows (in this case, people) and 12 columns, among which we find the target or class to predict,
`Survived`

. - The variable
`Cabin`

only has 295 instances with values, so it would contain more than 1000 null values. The variable`Age`

also has null values, but in a much smaller number than the previous one. The rest of the variables always have a value. - The data has 7 numerical characteristics and 5 categorical characteristics.

A very important point to take into account in this step is to eliminate those instances that could be duplicated in the dataset. This is crucial since, if left, the same point would have several representations, which is mathematically incoherent and incorrect. To do this, we have to be smart about looking for duplicates and know in advance if and where there are duplicates before eliminating them. In addition, we have to take into account that an instance can be repeated independently of the identifier it may have, so in this case, we are interested in eliminating the `PassengerId`

variable from the analysis, since it could be wrongly generated.

In [4]:

```
total_data.drop("PassengerId", axis = 1).duplicated().sum()
```

Out[4]:

In this case, we did not find any duplicate values. In the case that we had found it, the next step would be to apply the `drop_duplicates()`

function.

In [5]:

```
total_data = total_data.drop_duplicates(subset = total_data.columns.difference(['PassengerId']))
print(total_data.shape)
total_data.head()
```

Out[5]:

We would again exclude the identification column, although we could repeat the analysis by including it to enrich the analysis:

In [6]:

```
if total_data.duplicated().sum():
total_data = total_data.drop_duplicates()
print(total_data.shape)
total_data.head()
```

Out[6]:

When we want to prepare the data to train a predictive model, we must answer the following question: are all the features essential to make a prediction? Normally, the answer to that question is a resounding no. We have to try to be as objective as possible and carry out this preliminary process before the feature selection phase. Therefore, here what we will try to do is a controlled elimination of those variables that we can be sure that the algorithm will not use in the predictive process; these are `PassengerId`

, `Name`

, `Ticket`

and `Cabin`

.

In [7]:

```
total_data.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis = 1, inplace = True)
total_data.head()
```

Out[7]:

A **univariate variable** is a statistical term used to refer to a set of observations of an attribute. That is, the column-by-column analysis of the DataFrame. To do this, we must distinguish whether a variable is categorical or numerical, as the body of the analysis and the conclusions that can be drawn will be different.

A **categorical variable** is a type of variable that can be one of a limited number of categories or groups. These groups are often nominal (e.g., the color of a car: red, blue, black, etc., but none of these colors is inherently "greater" or "better" than the others) but can also be represented by finite numbers.

To represent these types of variables we will use histograms. Before we start plotting, we must identify which ones are categorical, and this can be easily checked by analyzing the range of values. In this case, the categorical variables are `Survived`

, `Sex`

, `Pclass`

, `Embarked`

, `SibSp`

and `Parch`

:

In [8]:

```
import matplotlib.pyplot as plt
import seaborn as sns
fig, axis = plt.subplots(2, 3, figsize = (10, 7))
# Create a multiple histogram
sns.histplot(ax = axis[0, 0], data = total_data, x = "Survived").set_xlim(-0.1, 1.1)
sns.histplot(ax = axis[0, 1], data = total_data, x = "Sex").set(ylabel = None)
sns.histplot(ax = axis[0, 2], data = total_data, x = "Pclass").set(ylabel = None)
sns.histplot(ax = axis[1, 0], data = total_data, x = "Embarked")
sns.histplot(ax = axis[1, 1], data = total_data, x = "SibSp").set(ylabel = None)
sns.histplot(ax = axis[1, 2], data = total_data, x = "Parch").set(ylabel = None)
# Adjust the layout
plt.tight_layout()
# Show the plot
plt.show()
```

A histogram is a graphical representation of the distribution of a data set. It is also used to understand the frequency of the data. By looking at a histogram, we can understand if the data is skewed towards one extreme, if it is symmetrical, if it has many outliers, and so on. With the representation of each variable, we can determine that:

**Survived**: The number of people who did not survive outnumbers those who did by more than 300.**Sex**: There were almost twice as many men as women on the Titanic.**Pclass**: The sum of passengers traveling in first and second class was almost identical to those traveling in third.**Embarked**: The majority of Titanic passengers embarked at Southampton (`S`

) station.**SibSp**: More than 800 passengers traveled alone. The remainder did so with their partner or someone else from their family.**Parch**: Almost all passengers traveled without parents or children. A small portion did.

A **numeric variable** is a type of variable that can take numeric values (integers, fractions, decimals, negatives, etc.) in an infinite range. A numerical categorical variable can also be a numerical variable (e.g. for further analysis, we can take the class `Survived`

as numerical also to study relationships). They are usually represented using a histogram and a boxplot, displayed together. Before starting to plot, we must also identify which are the numerical ones, which are `Fare`

, `Age`

and `PassengerId`

. However, the latter is meaningless, so we will plot the first two:

In [9]:

```
fig, axis = plt.subplots(2, 2, figsize = (10, 7), gridspec_kw={'height_ratios': [6, 1]})
# Creating a multiple figure with histograms and box plots
sns.histplot(ax = axis[0, 0], data = total_data, x = "Fare").set(xlabel = None)
sns.boxplot(ax = axis[1, 0], data = total_data, x = "Fare")
sns.histplot(ax = axis[0, 1], data = total_data, x = "Age").set(xlabel = None, ylabel = None)
sns.boxplot(ax = axis[1, 1], data = total_data, x = "Age")
# Adjust the layout
plt.tight_layout()
# Show the plot
plt.show()
```

The combination of the two previous graphs allows us to know the distribution and its statistical characteristics. From the resulting visualization, we can see that both variables have outliers that are far from the standard distribution and that their distributions are slightly asymmetric but close to a normal distribution; the first one is totally skewed to the left, where the mean is lower than the mode, and the other one has a lower tendency.

After analyzing the characteristics one by one, it is time to analyze them in relation to the predictor and to themselves, in order to draw clearer conclusions about their relationships and to be able to make decisions about their processing.

Thus, if we would like to eliminate a variable due to a high amount of null values or certain outliers, it is necessary to first apply this process to ensure that the elimination of certain values are not critical for the survival of a passenger. For example, the variable `Cabin`

has many null values, and we would have to ensure that there is no relationship between it and survival before eliminating it, since it could be very significant and important for the model and its presence could bias the prediction.

When the two variables being compared have numerical data, the analysis is said to be numerical-numerical. Scatterplots and correlation analysis are used to compare two numerical columns.

We will use the variable `Survived`

to start with the bivariate analysis because, being a categorical variable but coded in numbers, it can be considered numerical as well. We first analyze the class versus numeric characteristics:

In [10]:

```
fig, axis = plt.subplots(2, 2, figsize = (10, 7))
# Create a multiple scatter diagram
sns.regplot(ax = axis[0, 0], data = total_data, x = "Fare", y = "Survived")
sns.heatmap(total_data[["Survived", "Fare"]].corr(), annot = True, fmt = ".2f", ax = axis[1, 0], cbar = False)
sns.regplot(ax = axis[0, 1], data = total_data, x = "Age", y = "Survived").set(ylabel=None)
sns.heatmap(total_data[["Survived", "Age"]].corr(), annot = True, fmt = ".2f", ax = axis[1, 1])
# Adjust the layout
plt.tight_layout()
# Show the plot
plt.show()
```

There is a direct relationship (although not very strong) between ticket price (`Fare`

) and passenger survival. Thus, some passengers with a low fare were less likely to survive than those who purchased a ticket with a higher fare. There is also a negative linear relationship, weaker than the previous one between (`Age`

) and the target variable. This makes sense considering that children were one of the groups that had a preference in using the boats for survival.

In summary, despite there being some relationship between these characteristics versus the predictor, the significance is not very high, not being decisive factors on whether a passenger survived or not.

Next, we can also relate both variables to determine their degree of affinity or correlation:

In [11]:

```
fig, axis = plt.subplots(2, 1, figsize = (5, 7))
# Create a multiple scatter diagram
sns.regplot(ax = axis[0], data = total_data, x = "Age", y = "Fare")
sns.heatmap(total_data[["Fare", "Age"]].corr(), annot = True, fmt = ".2f", ax = axis[1])
# Adjust the layout
plt.tight_layout()
# Show the plot
plt.show()
```

It can be determined that there is not a very strong relationship between the two variables and that age has no impact on whether the ticket price is higher or not.

When the two variables being compared have categorical data, the analysis is said to be categorical-categorical. Histograms and combinations are used to compare two categorical columns.

First, we analyze the class against the categorical features, one by one. Here there will be no combinations of several predictors and the class:

In [12]:

```
fig, axis = plt.subplots(2, 3, figsize = (15, 7))
sns.countplot(ax = axis[0, 0], data = total_data, x = "Sex", hue = "Survived")
sns.countplot(ax = axis[0, 1], data = total_data, x = "Pclass", hue = "Survived").set(ylabel = None)
sns.countplot(ax = axis[0, 2], data = total_data, x = "Embarked", hue = "Survived").set(ylabel = None)
sns.countplot(ax = axis[1, 0], data = total_data, x = "SibSp", hue = "Survived")
sns.countplot(ax = axis[1, 1], data = total_data, x = "Parch", hue = "Survived").set(ylabel = None)
plt.tight_layout()
fig.delaxes(axis[1, 2])
plt.show()
```

The following conclusions can be drawn from the above graph:

- With higher proportion, women survived as opposed to men. This is because women had priority over men in the evacuation plans.
- People who traveled alone had more problems surviving than those who traveled accompanied.
- Those who traveled in a better class on the Titanic had a higher chance of survival.

Multivariate analysis also allows combining the class with several predictors at the same time to enrich the analysis. These types of operations must be subjective and must combine related characteristics. For example, it would not make sense to perform an analysis between the class, the passenger's sex, and the station where they boarded the Titanic, since there is no relationship between the passenger's sex and the station. However, the class and sex of the passenger versus their survival could be an analysis worthy of study, among other cases presented below:

In [13]:

```
fig, axis = plt.subplots(figsize = (10, 5), ncols = 2)
sns.barplot(ax = axis[0], data = total_data, x = "Sex", y = "Survived", hue = "Pclass")
sns.barplot(ax = axis[1], data = total_data, x = "Embarked", y = "Survived", hue = "Pclass").set(ylabel = None)
plt.tight_layout()
plt.show()
```

From these analyses, it is clear that, women had a higher chance of survival regardless of the port of embarkation and the class in which they traveled, which reinforces the knowledge obtained earlier. Furthermore, on average, people who traveled in higher classes survived longer than those who did not.

Multivariate analysis also allows combining the class with several predictors at the same time to enrich the analysis. These types of operations must be subjective and must combine related characteristics. For example, it would not make sense to perform an analysis between the class, the passenger's sex, and the station where they boarded the Titanic, since there is no relationship between the passenger's sex and the station. However, the class and sex of the passenger versus their survival could be an analysis worthy of study, among other cases presented below:

In [14]:

```
total_data["Sex_n"] = pd.factorize(total_data["Sex"])[0]
total_data["Embarked_n"] = pd.factorize(total_data["Embarked"])[0]
fig, axis = plt.subplots(figsize = (10, 6))
sns.heatmap(total_data[["Sex_n", "Pclass", "Embarked_n", "SibSp", "Parch", "Survived"]].corr(), annot = True, fmt = ".2f")
plt.tight_layout()
plt.show()
```

The correlation analysis shows a strong direct relationship between the sex (`Sex`

) of the passenger and his or her survival, as we have seen in previous sections. In addition, there is a relationship between the number of passengers' companions (variables `SibSp`

and `Parch`

). The rest of the correlations are weak and not significant enough to be included in the analysis.

Finally, to close the multivariate study, it remains to analyze the relationship between the categorical and numerical variables.

This is the most detailed analysis we can carry out. To do this, we simply have to calculate the correlations between the variables, since this is the best indication of the relationships. Thus, once we have verified that there is a relationship, we can go deeper into the study. Another element that can be very helpful is to obtain the two-by-two relationships between all the data in the dataset. This is, in part, redundant because there are many things that we have already calculated, so it is optional.

In [15]:

```
fig, axis = plt.subplots(figsize = (10, 7))
sns.heatmap(total_data[["Age", "Fare", "Sex_n", "Pclass", "Embarked_n", "SibSp", "Parch", "Survived"]].corr(), annot = True, fmt = ".2f")
plt.tight_layout()
plt.show()
```

There is a strong negative relationship between class type (`Pclass`

) and passenger age (`Age`

) (those traveling in first class were very old) and between class and fare paid (`Fare`

), which makes a lot of sense. The rest of the correlations remain the same as previously seen.

Having analyzed the correlation, let us analyze the two cases seen to corroborate the theory:

In [16]:

```
fig, axis = plt.subplots(figsize = (10, 5), ncols = 2)
sns.regplot(ax = axis[0], data = total_data, x = "Age", y = "Pclass")
sns.regplot(ax = axis[1], data = total_data, x = "Fare", y = "Pclass").set(ylabel = None, ylim = (0.9, 3.1))
plt.tight_layout()
plt.show()
```

In the first graph, we see that as age increases, the presence of first class tickets becomes more noticeable, and as age decreases, third class tickets become more present, reinforcing the negative relationship between the observed variables. The second graph also reinforces what was observed, as better class tickets should be more expensive.

Once the correlation has been calculated, we can draw the `pairplot`

(this is an optional step):

In [17]:

```
sns.pairplot(data = total_data)
```

Out[17]:

Now, let's work and practice today's lesson to reinforce what we have learned!

**Feature engineering** is a process that involves the creation of new features (or variables) from existing ones to improve model performance. This may involve a variety of techniques, such as normalization, data transformation, and so on. The goal is to improve the accuracy of the model and/or reduce the complexity of the model, thus making it easier to interpret.

In the previous steps, we have started working with the data by eliminating duplicates, accounting for null values, and even, calculating correlations, transforming `Sex`

and `Embarked`

into numerical categories. Although this could have been done in this step as it is part of the feature engineering, it is usually done before analyzing the variables, separating this process into a previous one and the one we are going to see next.

An **outlier** is a data point that deviates significantly from the others. It is a value that is noticeably different from what would be expected given the general trend of the data. These outliers may be caused by errors in data collection, natural variations in the data, or they may be indicative of something significant, such as an anomaly or extraordinary event.

**Descriptive analysis** is a powerful tool for characterizing the data set: the mean, variance and quartiles provide powerful information about each variable. The `describe()`

function of a DataFrame helps us to calculate in a very short time all these values.

In [18]:

```
total_data.describe()
```

Out[18]:

While experience is an important component in analyzing the results in the table above, we can use certain rules to detect them, such as looking at the minimum and maximum value of a specific characteristic and comparing it to its 25% and 75% percentile. For example, everything looks normal except for the `Fare`

column which has a mean of 32.20 but its 50% percentile is 14 and its maximum value is 512. We could say that 512 seems to be an outlier, but it could be a transcription error. It is also possible that the most expensive bill had that price. It would be useful to do some research and confirm or disprove that information.

Drawing box plots of the variables also gives us very powerful information about outliers that fall outside the confidence regions:

In [19]:

```
fig, axis = plt.subplots(3, 3, figsize = (15, 10))
sns.boxplot(ax = axis[0, 0], data = total_data, y = "Survived")
sns.boxplot(ax = axis[0, 1], data = total_data, y = "Pclass")
sns.boxplot(ax = axis[0, 2], data = total_data, y = "Age")
sns.boxplot(ax = axis[1, 0], data = total_data, y = "SibSp")
sns.boxplot(ax = axis[1, 1], data = total_data, y = "Parch")
sns.boxplot(ax = axis[1, 2], data = total_data, y = "Fare")
sns.boxplot(ax = axis[2, 0], data = total_data, y = "Sex_n")
sns.boxplot(ax = axis[2, 1], data = total_data, y = "Embarked_n")
plt.tight_layout()
plt.show()
```