Self-paced

Explore our extensive collection of courses designed to help you master various subjects and skills. Whether you're a beginner or an advanced learner, there's something here for everyone.

Bootcamp

Learn live

Join us for our free workshops, webinars, and other events to learn more about our programs and get started on your journey to becoming a developer.

Upcoming live events

Learning library

For all the self-taught geeks out there, here is our content library with most of the learning materials we have produced throughout the years.

It makes sense to start learning by reading and watching videos about fundamentals and how things work.

Search from all Lessons


LoginGet Started
← Back to Lessons
Open in Colab

Exploring Linear Regression

Linear Regression in Python

Next we will see how we can implement this model in Python. To do this, we will use the scikit-learn library.

Simple linear regression

To exemplify the implementation of a simple linear regression model, we will use a dataset with a few instances that has been previously treated with a full EDA.

Step 1. Reading the processed data set

In [1]:
import pandas as pd


train_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_salary_train.csv")
test_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_salary_test.csv")

train_data.head()
Out[1]:
YearsExperienceSalary
01.139343.0
11.346205.0
22.043525.0
32.239891.0
42.956642.0

As the exploratory analysis process has not been shown in this notebook, the relationship between the predictor variable and the target variable (this is seen in the univariate analysis) will be visualized below using a dot plot:

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns

fig, axis = plt.subplots(2, 1, figsize = (5, 7))
total_data = pd.concat([train_data, test_data])

sns.regplot(ax = axis[0], data = total_data, x = "YearsExperience", y = "Salary")
sns.heatmap(total_data[["Salary", "YearsExperience"]].corr(), annot = True, fmt = ".2f", ax = axis[1], cbar = False)

plt.tight_layout()

plt.show()
No description has been provided for this image

There is a clear linear relationship between the predictor variable and the target variable, so it can be easily modeled by this type of model. If the correlation were lower, the model would not have good accuracy.

The train set will be used to train the model, while the test set will be used to evaluate the effectiveness of the model. We will also split the predictors of the features.

Moreover, since there is only one predictor variable, it is not necessary to apply a normalization. If there were several, it would have to be applied.

In [3]:
X_train = train_data.drop(["Salary"], axis = 1)
y_train = train_data["Salary"]
X_test = test_data.drop(["Salary"], axis = 1)
y_test = test_data["Salary"]

Step 2: Initialization and training of the model

In [4]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
Out[4]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

After the training process, we can know the parameters (variables aa and bb) that the model has fitted:

In [5]:
print(f"Intercept (a): {model.intercept_}")
print(f"Coefficients (b): {model.coef_}")
Intercept (a): 26354.43069701219
Coefficients (b): [9277.78307971]

In this case, there is only one coefficient since the linear regression is simple.

Step 3: Model prediction

In [6]:
y_pred = model.predict(X_test)
y_pred
Out[6]:
array([ 40271.10531658,  54187.77993614,  68104.45455571,  89443.35563904,
       102432.25195063, 121915.59641802])

To compare the predicted value of the original, we can easily perform a comparative plot as follows:

In [7]:
fig, axis = plt.subplots(1, 2, figsize = (5, 3.5))
total_data = pd.concat([train_data, test_data])

# We use the parameters adjusted in the training to draw the regression line in the plots
regression_equation = lambda x: 26354.43069701219 + 9277.78307971 * x

sns.scatterplot(ax = axis[0], data = test_data, x = "YearsExperience", y = "Salary")
sns.lineplot(ax = axis[0], x = test_data["YearsExperience"], y = regression_equation(test_data["YearsExperience"]))
sns.scatterplot(ax = axis[1], x = test_data["YearsExperience"], y = y_pred)
sns.lineplot(ax = axis[1], x = test_data["YearsExperience"], y = regression_equation(test_data["YearsExperience"])).set(ylabel = None)

plt.tight_layout()

plt.show()
No description has been provided for this image

As we can see, the test predicted by the model will always fit the regression equation, since it is the one learned by the model. The figure on the left represents the actual values, while those on the right are the predicted ones. We see that some predicted values match with the actual values, and those that do not have a noticeable difference. We will see next the value of the metric to learn more about the performance of the algorithm.

To calculate the effectiveness of the model we will use the mean squared error (MSE) and the coefficient of determination (R2R^2), one of the most popular metrics:

In [8]:
from sklearn.metrics import mean_squared_error, r2_score

print(f"Mean squared error: {mean_squared_error(y_test, y_pred)}")
print(f"Coefficient of determination: {r2_score(y_test, y_pred)}")
Mean squared error: 37649779.451336615
Coefficient of determination: 0.959714925174946

The lower the RMSE value, the better the model. A perfect model (a hypothetical model that can always predict the exact expected value) would have a value for this metric of 0. We observe that there is a slippage of 37 million, so we could understand that it is very bad. If we rely on the R2R^2 value, we observe that it is 95%, a very high value, and then 95% of the data is explained by the model, so it is satisfactory.

Step 4: Optimization of results

This type of model cannot be optimized due to the absence of hyperparameters.

Multiple linear regression

To exemplify the implementation of a simple multiple regression model, we will use a data set with a few instances that has been previously treated with a full EDA.

Step 1. Reading the processed data set

In [9]:
import pandas as pd
import matplotlib.pyplot as plt 

train_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_weight-height_train.csv")
test_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_weight-height_test.csv")

train_data.head()
Out[9]:
GenderHeightWeight
0-1.0-0.575639151.275533
1-1.0-0.992843123.965162
2-1.0-0.925964124.765438
3-1.0-1.478210119.195698
41.0-1.598649146.956646

For this problem, we want to calculate the weight (weight) as a function of the height (height) and gender (gender) of the person. Therefore, weight will be the dependent variable (target variable), and height and gender will be the independent variables (predictor variables). Since this is a continuous numerical prediction, we have to solve this with a multiple logistic regression model.

As the exploratory analysis process has not been shown in this notebook, the relationship between the predictor variable and the target variables (this is seen in the univariate analysis) will be visualized below using a dot plot:

In [10]:
fig, axis = plt.subplots(2, 2, figsize = (10, 7))
total_data = pd.concat([train_data, test_data])

sns.regplot(ax = axis[0, 0], data = total_data, x = "Gender", y = "Weight")
sns.heatmap(total_data[["Weight", "Gender"]].corr(), annot = True, fmt = ".2f", ax = axis[1, 0], cbar = False)
sns.regplot(ax = axis[0, 1], data = total_data, x = "Height", y = "Weight")
sns.heatmap(total_data[["Weight", "Height"]].corr(), annot = True, fmt = ".2f", ax = axis[1, 1], cbar = False)

plt.tight_layout()

plt.show()
No description has been provided for this image

There is a clear linear relationship between the predictor variable and the target variables, so it can be easily modeled by this type of model. If the correlation were lower, the model would not have good accuracy.

In [11]:
X_train = train_data.drop(["Weight"], axis = 1)
y_train = train_data["Weight"]
X_test = test_data.drop(["Weight"], axis = 1)
y_test = test_data["Weight"]

Step 2: Initialization and training of the model

In [12]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
Out[12]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

After the training process, we can know the parameters (variables aa and b1,b2b_1, b_2) that the model has fitted:

In [13]:
print(f"Intercept (a): {model.intercept_}")
print(f"Coefficients (b1, b2): {model.coef_}")
Intercept (a): 161.48606316160345
Coefficients (b1, b2): [ 9.65020608 22.88377295]

Step 3: Model prediction

In [14]:
y_pred = model.predict(X_test)
y_pred
Out[14]:
array([105.17851056, 188.29501423, 137.05824216, ..., 112.17172027,
       130.89667195, 137.46475059])
In [15]:
from sklearn.metrics import mean_squared_error, r2_score

print(f"Mean squared error: {mean_squared_error(y_test, y_pred)}")
print(f"Coefficient of determination: {r2_score(y_test, y_pred)}")
Mean squared error: 98.21235363443171
Coefficient of determination: 0.9075866115171992

If we rely on the value of R2R^2, we observe that it is 90%, a very high value, and then 90% of the data is explained by the model, so it is satisfactory.

Step 4: Optimization of results

This type of model cannot be optimized due to the absence of hyperparameters.