Next we will see how we can implement this model in Python. To do this, we will use the scikit-learn
library.
To exemplify the implementation of a simple linear regression model, we will use a dataset with a few instances that has been previously treated with a full EDA.
import pandas as pd
train_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_salary_train.csv")
test_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_salary_test.csv")
train_data.head()
As the exploratory analysis process has not been shown in this notebook, the relationship between the predictor variable and the target variable (this is seen in the univariate analysis) will be visualized below using a dot plot:
import matplotlib.pyplot as plt
import seaborn as sns
fig, axis = plt.subplots(2, 1, figsize = (5, 7))
total_data = pd.concat([train_data, test_data])
sns.regplot(ax = axis[0], data = total_data, x = "YearsExperience", y = "Salary")
sns.heatmap(total_data[["Salary", "YearsExperience"]].corr(), annot = True, fmt = ".2f", ax = axis[1], cbar = False)
plt.tight_layout()
plt.show()
There is a clear linear relationship between the predictor variable and the target variable, so it can be easily modeled by this type of model. If the correlation were lower, the model would not have good accuracy.
The train set will be used to train the model, while the test set will be used to evaluate the effectiveness of the model. We will also split the predictors of the features.
Moreover, since there is only one predictor variable, it is not necessary to apply a normalization. If there were several, it would have to be applied.
X_train = train_data.drop(["Salary"], axis = 1)
y_train = train_data["Salary"]
X_test = test_data.drop(["Salary"], axis = 1)
y_test = test_data["Salary"]
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
After the training process, we can know the parameters (variables and ) that the model has fitted:
print(f"Intercept (a): {model.intercept_}")
print(f"Coefficients (b): {model.coef_}")
In this case, there is only one coefficient since the linear regression is simple.
y_pred = model.predict(X_test)
y_pred
To compare the predicted value of the original, we can easily perform a comparative plot as follows:
fig, axis = plt.subplots(1, 2, figsize = (5, 3.5))
total_data = pd.concat([train_data, test_data])
# We use the parameters adjusted in the training to draw the regression line in the plots
regression_equation = lambda x: 26354.43069701219 + 9277.78307971 * x
sns.scatterplot(ax = axis[0], data = test_data, x = "YearsExperience", y = "Salary")
sns.lineplot(ax = axis[0], x = test_data["YearsExperience"], y = regression_equation(test_data["YearsExperience"]))
sns.scatterplot(ax = axis[1], x = test_data["YearsExperience"], y = y_pred)
sns.lineplot(ax = axis[1], x = test_data["YearsExperience"], y = regression_equation(test_data["YearsExperience"])).set(ylabel = None)
plt.tight_layout()
plt.show()
As we can see, the test predicted by the model will always fit the regression equation, since it is the one learned by the model. The figure on the left represents the actual values, while those on the right are the predicted ones. We see that some predicted values match with the actual values, and those that do not have a noticeable difference. We will see next the value of the metric to learn more about the performance of the algorithm.
To calculate the effectiveness of the model we will use the mean squared error (MSE) and the coefficient of determination (), one of the most popular metrics:
from sklearn.metrics import mean_squared_error, r2_score
print(f"Mean squared error: {mean_squared_error(y_test, y_pred)}")
print(f"Coefficient of determination: {r2_score(y_test, y_pred)}")
The lower the RMSE value, the better the model. A perfect model (a hypothetical model that can always predict the exact expected value) would have a value for this metric of 0. We observe that there is a slippage of 37 million, so we could understand that it is very bad. If we rely on the value, we observe that it is 95%, a very high value, and then 95% of the data is explained by the model, so it is satisfactory.
This type of model cannot be optimized due to the absence of hyperparameters.
To exemplify the implementation of a simple multiple regression model, we will use a data set with a few instances that has been previously treated with a full EDA.
import pandas as pd
import matplotlib.pyplot as plt
train_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_weight-height_train.csv")
test_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_weight-height_test.csv")
train_data.head()
For this problem, we want to calculate the weight (weight
) as a function of the height (height
) and gender (gender
) of the person. Therefore, weight will be the dependent variable (target variable), and height and gender will be the independent variables (predictor variables). Since this is a continuous numerical prediction, we have to solve this with a multiple logistic regression model.
As the exploratory analysis process has not been shown in this notebook, the relationship between the predictor variable and the target variables (this is seen in the univariate analysis) will be visualized below using a dot plot:
fig, axis = plt.subplots(2, 2, figsize = (10, 7))
total_data = pd.concat([train_data, test_data])
sns.regplot(ax = axis[0, 0], data = total_data, x = "Gender", y = "Weight")
sns.heatmap(total_data[["Weight", "Gender"]].corr(), annot = True, fmt = ".2f", ax = axis[1, 0], cbar = False)
sns.regplot(ax = axis[0, 1], data = total_data, x = "Height", y = "Weight")
sns.heatmap(total_data[["Weight", "Height"]].corr(), annot = True, fmt = ".2f", ax = axis[1, 1], cbar = False)
plt.tight_layout()
plt.show()
There is a clear linear relationship between the predictor variable and the target variables, so it can be easily modeled by this type of model. If the correlation were lower, the model would not have good accuracy.
X_train = train_data.drop(["Weight"], axis = 1)
y_train = train_data["Weight"]
X_test = test_data.drop(["Weight"], axis = 1)
y_test = test_data["Weight"]
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
After the training process, we can know the parameters (variables and ) that the model has fitted:
print(f"Intercept (a): {model.intercept_}")
print(f"Coefficients (b1, b2): {model.coef_}")
y_pred = model.predict(X_test)
y_pred
from sklearn.metrics import mean_squared_error, r2_score
print(f"Mean squared error: {mean_squared_error(y_test, y_pred)}")
print(f"Coefficient of determination: {r2_score(y_test, y_pred)}")
If we rely on the value of , we observe that it is 90%, a very high value, and then 90% of the data is explained by the model, so it is satisfactory.
This type of model cannot be optimized due to the absence of hyperparameters.