Difficulty
easy
Average duration
2 hrs
Technologies
Data Science
Python
Machine Learning
linear regression
python pandas
Difficulty
easy
Average duration
2 hrs
Technologies
Data Science
Python
Machine Learning
linear regression
python pandas
Once you have finished solving the exercises, be sure to commit your changes, push them to your repository, and go to 4Geeks.com to upload the repository link.
Sociodemographic and health resource data have been collected by county in the United States and we want to find out if there is any relationship between health resources and sociodemographic data.
To do this, you need to set a target variable (health-related) to conduct the analysis.
The dataset can be found in this project folder under the name demographic_health_data.csv
. You can load it into the code directly from the link:
1https://raw.githubusercontent.com/4GeeksAcademy/regularized-linear-regression-project-tutorial/main/demographic_health_data.csv
Or download it and add it by hand in your repository. In this dataset you will find a large number of variables, which you will find defined here.
This second step is vital to ensure that we keep the variables that are strictly necessary and eliminate those that are not relevant or do not provide information. Use the example Notebook we worked on and adapt it to this use case.
Be sure to conveniently divide the data set into train
and test
as we have seen in previous lessons.
Start solving the problem by implementing a linear regression model and analyze the results. Then, using the same data and default attributes, build a Lasso model and compare the results with the baseline linear regression.
Analyze how evolves when the hyperparameter of the Lasso model changes (you can, for example, start testing from a value of 0.0 and work your way up to a value of 20). Draw these values in a line diagram.
After training the Lasso model, if the results are not satisfactory, optimize it using one of the techniques seen above.
Note: We also incorporated the solution samples on
./solution.ipynb
that we strongly suggest you only use if you are stuck for more than 30 min or if you have already finished and want to compare it with your approach.
Difficulty
easy
Average duration
2 hrs
Technologies
Data Science
Python
Machine Learning
linear regression
python pandas
Difficulty
easy
Average duration
2 hrs
Technologies
Data Science
Python
Machine Learning
linear regression
python pandas
Difficulty
easy
Average duration
2 hrs
Technologies
Data Science
Python
Machine Learning
linear regression
python pandas
Difficulty
easy
Average duration
2 hrs
Technologies
Data Science
Python
Machine Learning
linear regression
python pandas
Difficulty
easy
Average duration
2 hrs
Technologies
Data Science
Python
Machine Learning
linear regression
python pandas
Difficulty
easy
Average duration
2 hrs
Technologies
Data Science
Python
Machine Learning
linear regression
python pandas