“Hard work always beats talent when talent doesn't work hard” - Tim Notke
Once you have finished the project, you must deliver:
To carry out the final project, students will be organized in teams of 2 to 3 people. The work must be collaborative.
Start by defining a problem and turn it into a Machine Learning problem. This is the first step, since the data must meet a certain need and the Machine Learning process must aim at satisfying that need.
The choice of the data set must satisfy minimum requirements in terms of number of rows and predictor variables. At a minimum, it must contain:
NOTE: Depending on the dataset and the case study to be explored, datasets that do not reach the established minimum may be evaluated and accepted.
Since in the real world data does not usually arrive in a flat csv file, this data must be acquired by one of the following ways:
Once you have the data, you must store it in a CSV document and load it into Python using Pandas.
NOTE: Depending on the dataset and the case study to be explored, datasets downloaded by other means could be evaluated and accepted.
A widely used practice is to store the data, especially if they are massive, in a database for quick access to them. From all the databases we have studied, choose the one most compatible with your data and store it there. Then, perform queries using Python (with pure SQL code or using the wrappers we have studied in the course) to use the different statements: SELECT
, JOIN
, INSERT
.... These queries must provide a value to start the analysis on the data prior to the statistics and EDA.
It is important to understand that in the real world we do not only have CSV as an ally to store data, since it is easier to lose a flat file like CSV than a database with its connections and data models inside. Security is also a critical and important factor for storing your data there, since a CSV does not provide any protection mechanism that other technologies do.
The raw data stored in a database can be a great and very valuable source of information. Before we begin to simplify and exploit them with EDA, we must know their fundamental statistical measures: means, modes, distributions, deviations, etcetera. Analyze the descriptive statistical variables of each of the predictors of the data set and theorize about the distribution that each of them follows.
Use hypothesis tests if you consider it necessary.
This step is vital to ensure that we keep the variables that are strictly necessary and eliminate those that are not relevant or do not provide information. Use the example Notebook we worked on and adapt it to this use case.
Make sure to conveniently divide the data set into train
and test
as we have seen in previous lessons.
Once you have your data ready, decide which model fits best and train it. If in doubt, try using several of the ones you have already studied. Select the one that best fits the data.
Remember that the hyperparameter optimization step is very important to explore and achieve the best version of the model.
Create a Machine Learning web application using your saved model. You can use Flask, Streamlit or any other tool you know. Use Heroku, Render or another cloud computing platform of your choice to deploy your web application and share it with the world. Remember that the application is going to be the gateway to potential users or customers, and you have to take care of even the smallest detail.
The presentation will last 5 minutes per group, so be sure to use your time efficiently. The code will be reviewed, so don't waste time explaining it. Focus on the important points, as if you were trying to sell the project to your company's stakeholders or investors. Keep in mind that they probably don't have a technical background, so try to use simple words and an easy-to-understand presentation. Remember that quality trumps quantity.
Recommended important points to mention in your 5-minute presentation:
“The secret to getting ahead is to start.” - Mark Twain