← Back to Lessons
Edit on Github

How to take the Exploratory Data-analysis Module

How to read this module

In this module we will work together on the Titanic passengers dataset. This will be a big notebook that lasts for the three days of this module. It includes exploratory data analysis and also data preprocessing techniques to get our dataframe ready for modeling.

Please use the 'exploratory-data-analysis' notebook as your main notebook to learn about EDA and data cleaning process. You need to read it step by step and execute the code cells of each process. Click on each reference link when it asks you to do so, in order to read the different methods for each process. So before you try to understand the code of each process, make sure you read the reference link of that part of the process.

Every day, after reading and executing the Titanic guiding notebook, go to the module's project to put in practice your new skills for each process. Read the project instructions and make some progress every day.

Structure of the Titanic guiding notebook

The Titanic guiding notebook is divided in 3 days.

Day 1: Exporatory data analysis on Titanic dataset

Understand the problem

Loading the data

Visualizations and finding relationships

Execute code cells for each part of the Titanic EDA

Day 2: Feature Engineering

Read the feature engineering lecture and come back to the main notebook to start the process step by step.

  • Outliers: Read the introduction, then follow the link to read the different methods and then come back to your main notebook to understand and execute the code for that part of the Titanic cleaning process.

  • Missing values: Read the introduction, then follow the link to read the different methods and then come back to your main notebook to understand and execute the code for that part of the Titanic cleaning process.

  • Feature encoding for categorical variables: Read the introduction, then follow the link to read the different methods and then come back to your main notebook to understand and execute the code for that part of the Titanic cleaning process.

  • Feature scaling: Read the introduction, then follow the link to read the different methods and then come back to your main notebook to understand and execute the code for that part of the Titanic cleaning process.

  • Feature Selection: Read the introduction, then follow the link to read the different methods and then come back to your main notebook to understand and execute the code for that part of the Titanic cleaning process.

Day 3: Feature Selection

Read the introduction, then follow the link to read the different methods and then come back to your main notebook to understand and execute the code for that part of the cleaning process.

Use this day to understand this final part of the data cleaning process, and specially to finish working on your Project.

Remember the phrase 'Garbage in, garbage out'. If your model receives poor quality data, you will also produce a poor model.

Introducing new libraries

We will also introduce in this module two new libraries, seaborn and scikit-learn, which will certainly make the job much easier.

You will also find a very complete data science cheatsheet with several libraries and its most used commands.

Seaborn

Seaborn is a python data visualization library for making statistical graphics in Python. It builds on top of matplotlib and integrates closely with pandas data structures. It provides a high-level interface for drawing attractive and informative statistical graphics.

We normally import seaborn library like this:

import seaborn as sns

In case you run into any problem of not having seaborn installed, you just need to type in the terminal:

pip install seaborn

In the Titanic guiding notebook we show some examples using Seaborn but you can check seaborn's website for more information.

https://seaborn.pydata.org/

Scikit-learn

Scikit-learn is perhaps Python’s most useful machine learning library. Regression, dimensionality reduction, classification and clustering are only a few of the useful methods in the “Sklearn Library” for statistical modeling and for creating ML models. But it will also be very useful for the data transformation steps, for example encoding and scaling.

In the Titanic guiding notebook we mention several scikit-learn commands but you can use Scikit-learn documentation for more commands and a better understanding of its parameters.

https://scikit-learn.org/stable/data_transforms.html


Subscribe for more!


COMPANY

ABOUT

CONTACT

MEDIA KIT

SOCIAL & LIVE LEARNING

The most efficient way to learn: Join a cohort with classmates like yourself, live streamings, coding jam sessions, live mentorships with real experts and keep the motivation.

INTRO TO CODING

From zero to getting paid as a developer, learn the skills of the present and future. Boost your professional career and get hired by a tech company.

DATA SCIENCE

Start a career in data science and analytics. A hands-on approach with interactive exercises, chat support, and access to mentorships.

30DAYSOFGEEKCODING

Keep your motivation with this 30 day challenge. Join hundreds of other developers coding a little every day.

A.I. & MACHINE LEARNING

Start with Python and Data Science, Machine Learning, Deep Learning and maintaining a production environment in A.I.


©4Geeks Academy LLC 2019

Privacy policies


Cookies policies


Terms & Conditions