4Geeks logo
About us

Learning library

For all the self-taught geeks out there, here is our content library with most of the learning materials we have produced throughout the years.

It makes sense to start learning by reading and watching videos about fundamentals and how things work.

Data Science and Machine Learning - 16 wks

Full-Stack Software Developer - 16w

Search from all Lessons

Social & live learning

The most efficient way to learn: Join a cohort with classmates just like you, live streams, impromptu coding sessions, live tutorials with real experts, and stay motivated.

← Back to Lessons
Edit on Github
Open in Collab

How to take the Exploratory Data-analysis Module

Exploratory data analysis

Exploratory data analysis (EDA) is the first step to solve any Machine Learning problem. It consists of a process that seeks to analyze and investigate the available data sets and summarize their main characteristics, often using data visualization techniques. This analysis is carried out through a series of steps detailed below.

In this section we will delve into the concept by working with the Titanic dataset.

DAY 1

Step 1: Problem statement and data collection

Before starting to analyze the dataset, we must understand, on the one hand, the problem or challenge we are trying to solve with this information and how suitable or useful it can be for us.

In this case, we want to analyze which people did or did not survive the sinking of the Titanic and, in successive phases, be able to train a Machine Learning model to answer the question: "What kind of people were most likely to survive?". Therefore, we find that the dataset we have available can help us to solve the question posed and we apply an EDA process to learn more about it in detail.

We will import the dataset to start working with it:

In [1]:
import pandas as pd

train_data = pd.read_csv("/workspaces/machine-learning-content/assets/titanic_train.csv")
test_data = pd.read_csv("/workspaces/machine-learning-content/assets/titanic_test.csv")
test_survived_data = pd.read_csv("/workspaces/machine-learning-content/assets/gender_submission.csv")
test_data["Survived"] = test_survived_data["Survived"]

total_data = pd.concat([train_data, test_data]).reset_index(inplace = False)
total_data.drop(columns = ["index"], inplace = True)
total_data.head()
Out[1]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

Step 2: Exploration and data cleaning

Once we have loaded the dataset, we must analyze it in its entirety, without disintegration of train and test, to obtain joint conclusions. Once we have the information loaded in a manageable data structure such as a Pandas DataFrame, we can start with the process.

Knowing the dimensions and data types of the object we are working with is vital. For this we need the shape attribute to obtain the dimensions of the object and the info() function to know the typology and the amount of non-null values:

In [2]:
# Obtain dimensions
total_data.shape
Out[2]:
(1309, 12)
In [3]:
# Obtain information about data types and non-null values
total_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     1309 non-null   int64  
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
dtypes: float64(2), int64(5), object(5)
memory usage: 122.8+ KB

Once we have obtained this information, it is important that we are able to draw conclusions, such as the following:

  • There are a total of 836 rows (in this case, people) and 12 columns, among which we find the target or class to predict, Survived.
  • The variable Cabin only has 295 instances with values, so it would contain more than 1000 null values. The variable Age also has null values, but in a much smaller number than the previous one. The rest of the variables always have a value.
  • The data has 7 numerical characteristics and 5 categorical characteristics.

Eliminate duplicates

A very important point to take into account in this step is to eliminate those instances that could be duplicated in the dataset. This is crucial since, if left, the same point would have several representations, which is mathematically incoherent and incorrect. To do this, we have to be smart about looking for duplicates and know in advance if and where there are duplicates before eliminating them. In addition, we have to take into account that an instance can be repeated independently of the identifier it may have, so in this case we are interested in eliminating the PassengerId variable from the analysis, since it could be wrongly generated.

In [4]:
total_data.drop("PassengerId", axis = 1).duplicated().sum()
Out[4]:
0

In this case, we did not find any duplicate values. In the case that we had found it, the next step would be to apply the drop_duplicates() function.

In [5]:
total_data = total_data.drop_duplicates(subset = total_data.columns.difference(['PassengerId']))
print(total_data.shape)
total_data.head()
(1309, 12)
Out[5]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

We would again exclude the identification column, although we could repeat the analysis by including it to enrich the analysis:

In [6]:
if total_data.duplicated().sum():
    total_data = total_data.drop_duplicates()
print(total_data.shape)
total_data.head()
(1309, 12)
Out[6]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

Eliminate irrelevant information

When we want to prepare the data to train a predictive model we must answer the following question: are all the features essential to make a prediction? Normally, that question is a resounding no. We have to try to be as objective as possible. We have to try to be as objective as possible and carry out this preliminary process before the feature selection phase. Therefore, here what we will try to do is a controlled elimination of those variables that we can be sure that the algorithm will not use in the predictive process, these are PassengerId, Name, Ticket and Cabin.

In [7]:
total_data.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis = 1, inplace = True)
total_data.head()
Out[7]:
SurvivedPclassSexAgeSibSpParchFareEmbarked
003male22.0107.2500S
111female38.01071.2833C
213female26.0007.9250S
311female35.01053.1000S
403male35.0008.0500S

Step 3: Analysis of univariate variables

A univariate variable is a statistical term used to refer to a set of observations of an attribute. That is, the column-by-column analysis of the DataFrame. To do this, we must distinguish whether a variable is categorical or numerical, as the body of the analysis and the conclusions that can be drawn will be different.

Analysis on categorical variables

A categorical variable is a type of variable that can take one of a limited number of categories or groups. These groups are often nominal (e.g., the color of a car: red, blue, black, etc., but none of these colors is inherently "greater" or "better" than the others) but can also be represented by finite numbers.

To represent these types of variables we will use histograms. Before we start plotting, we must identify which ones are categorical, and this can be easily checked by analyzing the range of values. In this case, the categorical variables are Survived, Sex, Pclass, Embarked, SibSp and Parch:

In [8]:
import matplotlib.pyplot as plt 
import seaborn as sns

fig, axis = plt.subplots(2, 3, figsize = (10, 7))

# Create a multiple histogram
sns.histplot(ax = axis[0, 0], data = total_data, x = "Survived").set_xlim(-0.1, 1.1)
sns.histplot(ax = axis[0, 1], data = total_data, x = "Sex").set(ylabel = None)
sns.histplot(ax = axis[0, 2], data = total_data, x = "Pclass").set(ylabel = None)
sns.histplot(ax = axis[1, 0], data = total_data, x = "Embarked")
sns.histplot(ax = axis[1, 1], data = total_data, x = "SibSp").set(ylabel = None)
sns.histplot(ax = axis[1, 2], data = total_data, x = "Parch").set(ylabel = None)

# Adjust the layout
plt.tight_layout()

# Show the plot
plt.show()

A histogram is a graphical representation of the distribution of a data set. It is also used to understand the frequency of the data. By looking at a histogram, we can understand if the data is skewed towards one extreme, if it is symmetrical, if it has many outliers, and so on. With the representation of each variable we can determine that:

  • Survived: The number of people who did not survive outnumber those who did by more than 300.
  • Sex: There were almost twice as many men as women on the Titanic.
  • Pclass: The sum of passengers traveling in first and second class was almost identical to those traveling in third.
  • Embarked: The majority of Titanic passengers embarked at Southampton (S) station.
  • SibSp: More than 800 passengers traveled alone. The remainder with their partner or someone else from their family.
  • Parch: Almost all passengers traveled without parents or children. A small portion did.

Analysis on numeric variables

A numeric variable is a type of variable that can take numeric values (integers, fractions, decimals, negatives, etc) in an infinite range. A numerical categorical variable can also be a numerical variable (e.g. for further analysis, we can take the class Survived as numerical also to study relationships). They are usually represented using a histogram and a boxplot, displayed together. Before starting to plot, we must also identify which are the numerical ones, which are Fare, Age and PassengerId. However, the latter is meaningless, so we will plot the first two:

In [9]:
fig, axis = plt.subplots(2, 2, figsize = (10, 7), gridspec_kw={'height_ratios': [6, 1]})

# Creating a multiple figure with histograms and box plots
sns.histplot(ax = axis[0, 0], data = total_data, x = "Fare").set(xlabel = None)
sns.boxplot(ax = axis[1, 0], data = total_data, x = "Fare")
sns.histplot(ax = axis[0, 1], data = total_data, x = "Age").set(xlabel = None, ylabel = None)
sns.boxplot(ax = axis[1, 1], data = total_data, x = "Age")

# Adjust the layout
plt.tight_layout()

# Show the plot
plt.show()

The combination of the two previous graphs allows us to know the distribution and its statistical characteristics. From the resulting visualization we can see that both variables have outliers that are far from the standard distribution and that their distributions are slightly unsymmetric but close to a normal distribution; the first one totally skewed to the left, where the mean is lower than the mode and the other one with a lower tendency.

Step 4: Analysis of multivariate variables

After analyzing the characteristics one by one, it is time to analyze them in relation to the predictor and to themselves, in order to draw clearer conclusions about their relationships and to be able to make decisions about their processing.

Thus, if we would like to eliminate a variable due to a high amount of null values or certain outliers, it is necessary to first apply this process to ensure that the elimination of certain values are not critical for the survival of a passenger. For example, the variable Cabin has many null values, and we would have to ensure that there is no relationship between it and survival before eliminating it, since it could be very significant and important for the model and its presence could bias the prediction.

Numerical-numerical analysis

When the two variables being compared have numerical data, the analysis is said to be numerical-numerical. Scatterplots and correlation analysis are used to compare two numerical columns.

Survived - (Face, Age)

We will use the variable Survived to start with the bivariate analysis because being a categorical variable but coded in numbers, it can be considered as numerical as well. We first analyze the class versus numeric characteristics:

In [10]:
fig, axis = plt.subplots(2, 2, figsize = (10, 7))

# Create a multiple scatter diagram
sns.regplot(ax = axis[0, 0], data = total_data, x = "Fare", y = "Survived")
sns.heatmap(total_data[["Survived", "Fare"]].corr(), annot = True, fmt = ".2f", ax = axis[1, 0], cbar = False)
sns.regplot(ax = axis[0, 1], data = total_data, x = "Age", y = "Survived").set(ylabel=None)
sns.heatmap(total_data[["Survived", "Age"]].corr(), annot = True, fmt = ".2f", ax = axis[1, 1])

# Adjust the layout
plt.tight_layout()

# Show the plot
plt.show()

There is a direct relationship (although not very strong) between ticket price (Fare) and passenger survival. Thus, some passengers with a low fare were less likely to survive than those who purchased a ticket with a higher fare. There is also a negative linear relationship, weaker than the previous one between age (Age) and the target variable. This makes sense considering that children were one of the groups that had a preference in using the boats for survival.

In summary, despite there being some relationship with these characteristics versus the predictor, the significance is not very high, not being decisive factors on whether a passenger survived or not.

Face - Age

Next we can also relate both variables to determine their degree of affinity or correlation:

In [11]:
fig, axis = plt.subplots(2, 1, figsize = (5, 7))

# Create a multiple scatter diagram
sns.regplot(ax = axis[0], data = total_data, x = "Age", y = "Fare")
sns.heatmap(total_data[["Fare", "Age"]].corr(), annot = True, fmt = ".2f", ax = axis[1])

# Adjust the layout
plt.tight_layout()

# Show the plot
plt.show()

It can be determined that there is not a very strong relationship between the two variables and that age has no impact on whether the ticket price is higher or not.

Categorical-categorical analysis

When the two variables being compared have categorical data, the analysis is said to be categorical-categorical. Histograms and combinations are used to compare two numerical columns.

Survived - (Sex, Pclass, Embarked, SibSp, Parch)

First we analyze the class against the categorical features, one by one. Here there will be no combinations of several predictors and the class:

In [12]:
fig, axis = plt.subplots(2, 3, figsize = (15, 7))

sns.countplot(ax = axis[0, 0], data = total_data, x = "Sex", hue = "Survived")
sns.countplot(ax = axis[0, 1], data = total_data, x = "Pclass", hue = "Survived").set(ylabel = None)
sns.countplot(ax = axis[0, 2], data = total_data, x = "Embarked", hue = "Survived").set(ylabel = None)
sns.countplot(ax = axis[1, 0], data = total_data, x = "SibSp", hue = "Survived")
sns.countplot(ax = axis[1, 1], data = total_data, x = "Parch", hue = "Survived").set(ylabel = None)

plt.tight_layout()
fig.delaxes(axis[1, 2])

plt.show()

The following conclusions can be drawn from the above graph:

  • With higher proportion women survived as opposed to men. This is because women had priority over men in the evacuation plans.
  • People who traveled alone had more problems surviving than those who traveled accompanied.
  • Those who traveled in a better class on the Titanic had a higher chance of survival.
Combinations of class with various predictors.

Multivariate analysis also allows combining the class with several predictors at the same time to enrich the analysis. These types of operations must be subjective and must combine related characteristics. For example, it would not make sense to perform an analysis between the class, the passenger's sex and the station where he/she boarded the Titanic, since there is no relationship between the passenger's sex and the station. However, the sex, class and sex of the passenger versus their survival could be an analysis worthy of study, among other casuistries presented below:

In [13]:
fig, axis = plt.subplots(figsize = (10, 5), ncols = 2)

sns.barplot(ax = axis[0], data = total_data, x = "Sex", y = "Survived", hue = "Pclass")
sns.barplot(ax = axis[1], data = total_data, x = "Embarked", y = "Survived", hue = "Pclass").set(ylabel = None)

plt.tight_layout()

plt.show()

From these analyses, it is clear that, regardless of the port of embarkation, women had a higher chance of survival regardless of the class in which they traveled, which reinforces the knowledge obtained earlier. Furthermore, on average, people who traveled in higher classes survived longer than those who did not.

Correlation analysis

Multivariate analysis also allows combining the class with several predictors at the same time to enrich the analysis. These types of operations must be subjective and must combine related characteristics. For example, it would not make sense to perform an analysis between the class, the passenger's sex and the station where he/she boarded the Titanic, since there is no relationship between the passenger's sex and the station. However, the sex, class and sex of the passenger versus their survival could be an analysis worthy of study, among other casuistries presented below:

In [14]:
total_data["Sex_n"] = pd.factorize(total_data["Sex"])[0]
total_data["Embarked_n"] = pd.factorize(total_data["Embarked"])[0]

fig, axis = plt.subplots(figsize = (10, 6))

sns.heatmap(total_data[["Sex_n", "Pclass", "Embarked_n", "SibSp", "Parch", "Survived"]].corr(), annot = True, fmt = ".2f")

plt.tight_layout()

plt.show()

The correlation analysis shows a strong direct relationship between the sex (Sex) of the passenger and his or her survival, as we have seen in previous sections. In addition, there is a relationship between the number of passengers' companions (variables SibSp and Parh). The rest of the correlations are weak and not significant enough to be included in the analysis.

Finally, to close the multivariate study, it remains to analyze the relationship between the categorical and numerical variables.

Numerical-categorical analysis (complete)

This is the most detailed analysis we can carry out. To do this, we simply have to calculate the correlations between the variables, since this is the best indication of the relationships. Thus, once we have verified that there is a relationship, we can go deeper into the study. Another element that can be very helpful is to obtain the two-by-two relationships between all the data in the dataset. This is, in part, redundant because there are many things that we have already calculated before and so it is optional.

In [15]:
fig, axis = plt.subplots(figsize = (10, 7))

sns.heatmap(total_data[["Age", "Fare", "Sex_n", "Pclass", "Embarked_n", "SibSp", "Parch", "Survived"]].corr(), annot = True, fmt = ".2f")

plt.tight_layout()

plt.show()

There is a strong negative relationship between class type (Pclass) and passenger age (Age) (those traveling in first class were very old) and between class and fare paid (Fare), which makes a lot of sense. The rest of the correlations remain the same as previously seen.

Having analyzed the correlation, let us analyze the two cases seen to corroborate the theory:

In [16]:
fig, axis = plt.subplots(figsize = (10, 5), ncols = 2)

sns.regplot(ax = axis[0], data = total_data, x = "Age", y = "Pclass")
sns.regplot(ax = axis[1], data = total_data, x = "Fare", y = "Pclass").set(ylabel = None, ylim = (0.9, 3.1))

plt.tight_layout()

plt.show()

In the first graph we see that as age increases the presence of first class tickets becomes more noticeable, and as age decreases, third class tickets become more present, reinforcing the negative relationship between the observed variables. The second graph also reinforces what was observed, as better class tickets should be more expensive.

Once the correlation has been calculated, we can draw the pairplot (this is an optional step):

In [17]:
sns.pairplot(data = total_data)
/workspaces/machine-learning-content/.venv/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
Out[17]:
<seaborn.axisgrid.PairGrid at 0x7ff830f56f50>

END OF DAY 1

Now, let's work and practice today's lesson to reinforce what we have learned!

DAY 2

Step 5: Feature engineering

Feature engineering is a process that involves the creation of new features (or variables) from existing ones to improve model performance. This may involve a variety of techniques such as normalization, data transformation, and so on. The goal is to improve the accuracy of the model and/or reduce the complexity of the model, thus making it easier to interpret.

In the previous steps we have started working with the data by eliminating duplicates, accounting for null values and even, to calculate correlations, transforming Sex and Embarked into numerical categories. Although this could have been done in this step as it is part of the feature engineering, it is usually done before analyzing the variables, separating this process into a previous one and the one we are going to see next.

Outlier analysis

An outlier is a data point that deviates significantly from the others. It is a value that is noticeably different from what would be expected given the general trend of the data. These outliers may be caused by errors in data collection, natural variations in the data, or they may be indicative of something significant, such as an anomaly or extraordinary event.

Descriptive analysis is a powerful tool for characterizing the data set: the mean, variance and quartiles provide powerful information about each variable. The describe() function of a DataFrame helps us to calculate in a very short time all these values.

In [18]:
total_data.describe()
Out[18]:
SurvivedPclassAgeSibSpParchFareSex_nEmbarked_n
count1309.0000001309.0000001046.0000001309.0000001309.0000001308.0000001309.0000001309.000000
mean0.3773872.29488229.8811380.4988540.38502733.2954790.3559970.392666
std0.4849180.83783614.4134931.0416580.86556051.7586680.4789970.655586
min0.0000001.0000000.1700000.0000000.0000000.0000000.000000-1.000000
25%0.0000002.00000021.0000000.0000000.0000007.8958000.0000000.000000
50%0.0000003.00000028.0000000.0000000.00000014.4542000.0000000.000000
75%1.0000003.00000039.0000001.0000000.00000031.2750001.0000001.000000
max1.0000003.00000080.0000008.0000009.000000512.3292001.0000002.000000

While experience is an important component in analyzing the results in the table above, we can use certain rules to detect them, such as looking at the minimum and maximum value of a specific characteristic and comparing it to its 25% and 75% percentile. For example, everything looks normal except for the Fare column which has a mean of 32.20 but its 50% percentile is 14 and its maximum value is 512. We could say that 512 seems to be an outlier, but it could be a transcription error. It is also possible that the most expensive bill had that price. It would be useful to do some research and confirm or disprove that information.

Drawing box plots of the variables also gives us very powerful information about outliers that fall outside the confidence regions:

In [19]:
fig, axis = plt.subplots(3, 3, figsize = (15, 10))

sns.boxplot(ax = axis[0, 0], data = total_data, y = "Survived")
sns.boxplot(ax = axis[0, 1], data = total_data, y = "Pclass")
sns.boxplot(ax = axis[0, 2], data = total_data, y = "Age")
sns.boxplot(ax = axis[1, 0], data = total_data, y = "SibSp")
sns.boxplot(ax = axis[1, 1], data = total_data, y = "Parch")
sns.boxplot(ax = axis[1, 2], data = total_data, y = "Fare")
sns.boxplot(ax = axis[2, 0], data = total_data, y = "Sex_n")
sns.boxplot(ax = axis[2, 1], data = total_data, y = "Embarked_n")

plt.tight_layout()

plt.show()