This lesson is a roadmap on how to learn Python for data science. If you already know Python and have -at least some- experience with the data science libraries Pandas, Numpy, etc. You can skip ahead to the next lesson.
Python is a very diverse language used to build robots, networks, websites, APIs, games, artificial intelligence, and more. Depending on what you are making, the syntax can vary so much that you hardly recognize similarities between codebases.
Don't get me wrong, no matter what you are building, you will be looping, using conditions, logical expressions, filtering, using functions and classes, but you will be doing those things in ways that feel differently. Let me give you an example:
Many ways to filter colors If you want to filter a list of numbers with only odd numbers, you can do it like this:
1# this is our common list of numbers 2numbers = [23,34,5,6,45,34,23,5,45,5,324,23,354,65,564,45,342] 3 4# Filtering with basic python 3 5odds = [n for n in numbers if n % 2 != 0] 6print(odds) 7 8# Filtering with basic python 3 9odds = list(filter(lambda n: n % 2 != 0,numbers)) 10print(odds) 11 12# Filtering with numpy 13import numpy as np 14odds = np.array(numbers) 15print(odds[odds % 2 != 0])
In the example above, the three ways will filter and output the same list of odds numbers, but in the background (at the lowest machine level), things are executing in a very different way.
For this and many other reasons, it is convenient to start learning the Datascience syntax from the beginning.
In the data science world of Python, there is a family of libraries that are on the top of the food chain: Numpy, Pandas, Seaborn, Matplotlib, Scikit, etc. We will be learning each of them during the next couple of weeks while building real-life projects.
We will be using A LOT of 3rd party code and functions (from other people); it's better to learn about a python feature called help()
that will tell you how to use almost anything.
If you don't know how to code, it's better to forget about data science for a second and start learning Python from scratch: What are variables: integer, float, string, null (or None), boolean.
Since we are focused on using Python for data science, we will be working with massive amounts of data; I strongly recommend starting your training focusing on exercises and projects about the following:
Start with Numpy: because it's the bare bones, all the other libraries either work on top of Numpy or are compatible with it. In addition, Numpy comes with The Array and a series of functions to save you lots of time and processing power for the typical operations when dealing with big chunks of data.
Continue with Pandas and Seaborn: built on top of Numpy, the panda's library incorporates "The DataFrame" object; Simplifying the import, export, and transformation of multidimensional datasets.
Chart your DataFrame with Matplotlib: This library is responsible for the majority of the data visualizations you see in the world of data science: From a simple bar chart to histograms, it allows to use of arrays, data frames, and algebra to create data visualizations.
Use statistic functions with stats: This library includes the most common statistics functions like mean, std, variance, correlation, etc.
Create models with scikit learn: Simple and efficient tools for predictive data analysis. Built on NumPy, SciPy, and matplotlib.
When creating algorithms for websites, API's, networks, robots or other applications; You don't necessarily about performace because you can always upgrade your server or your CPU. Obviusly performance it's still an important variable, but you can have a slow website or you can pay a lot more to "speed it up".
On the other hand, when building AI's, performance is a big concern. You could say that "processing power" is the biggest limitation this field has right now. There is so much data and models need so much training that current technology cannot keep up.
That is why you should know a little about Big O
and optimizing algorithms for space and time.
Most -if not all- of the python courses about machine learning heavily use Jupyter Notebooks to build and clean your data and run the models. At 4Geeks we don't like this approach much, we prefer to stay working on .py
files and work on coding editos like VSCode, Pycharm, etc. like software engineers do in all the other fields were python is used.
We like using Jupyter as a communication tool, mainly to tell stories to management, show the strategy and some brainstorming.
If we are not going to be using Jupyter notebooks as much as other datascientis, we need to become very good at setting up Python environments, downloading python packages, using the PIP package manager, etc.
We will be using a template for our projects called Cookie Cutter Datascience. Using a template is always a good idea to organize your peroject files and workflow. You can read the documentation and watch this video on how it works.