Self-paced

Explore our extensive collection of courses designed to help you master various subjects and skills. Whether you're a beginner or an advanced learner, there's something here for everyone.

Bootcamp

Learn live

Join us for our free workshops, webinars, and other events to learn more about our programs and get started on your journey to becoming a developer.

Upcoming live events

Learning library

For all the self-taught geeks out there, here is our content library with most of the learning materials we have produced throughout the years.

It makes sense to start learning by reading and watching videos about fundamentals and how things work.

Search from all Lessons


LoginGet Started
← Back to Lessons
Edit on Github

What is a Linear Regression and how to use it in machine learning models

Linear regression is a type of model used to predict the value of a dependent variable (or target variable) based on the value of one or more independent variables (or predictor variables). Linear regression assumes that there is a direct linear relationship between the independent variables and the dependent variable. If the relationship between the target variable is one single predictor variable, the regression is said to be simple. If it has several predictors, it is called multiple regression.

This model is based on five assumptions, which are as follows:

  1. Linearity: The target variable and the predictor(s) have a linear relationship.
  2. Independence: The observations are independent of each other.
  3. Homoscedasticity: The variance of the errors (i.e., the differences between the model predictions and the actual data) is constant at all levels of the independent variables.
  4. Normality of errors: Errors are normally distributed. This is important for performing statistical tests and constructing confident intervals.
  5. Absence of multicollinearity: In a multiple regression, the independent variables are not perfectly correlated with each other. If there is perfect correlation, the data is said to have multicollinearity (there are variables that are the same), which makes it difficult to calculate the coefficients.

What is a Simple Linear Regression

simple linear regression

Simple linear regression allows us to study the relationships between two continuous numerical variables. In this type of regression, we try to fit a straight line to the data that best describes the relationship between the two variables. That is, we look for a line that minimizes the vertical distance between itself and all the data points, so that the best linear relationship occurs when all the points form the line and there is no dispersion.

The equation that defines this relationship (straight line) is:

Y=a+bX+eY = a + bX + e

Where:

  • YY is the dependent variable we are trying to predict or model.
  • XX is the independent variable that we use to make the prediction.
  • aa and bb are the coefficients we want the model to learn. aa is the intercept (value of YY when XX is zero) and bb is the slope.
  • ee is the prediction error discussed above. It is the difference between the actual value of YY and the value of YY predicted by the model.

The goal of simple linear regression is, therefore, to find the best values of aa and bb that minimize the error ee. Once we have found these values, we can use them to predict the values of YY given any XX.

In linear regression, each dependent value has a single corresponding independent variable.

What is a Multiple Linear Regression

multiple linear regression

Multiple linear regression is an extension of simple linear regression used when there is more than one independent variable. It is used to model the relationship between two or more characteristics and a response by fitting a linear equation (more extended than the above) to the observed data.

The basic form of a multiple linear regression equation with n variables is:

Y=a+b1X1+b2X2+...+bnXn+eY = a + b_1X_1 + b_2X_2 + ... + b_nX_n + e

Where:

  • YY is the dependent variable that we are trying to predict or model.
  • X1,X2,...,XnX_1, X_2, ..., X_n are the independent variables that we use to do the prediction.
  • aa and b1,b2,...,bnb_1, b_2, ..., b_n are the coefficients we want the model to learn.
  • ee is the prediction error discussed above. It is the difference between the actual value of YY and the value of YY predicted by the model.

Multiple linear regression allows the analyst to determine which particular variables have a significant impact on the dependent variable and by what magnitude. Like simple linear regression, this regression makes assumptions of linearity, normality, homoscedasticity, and absence of multicollinearity, and the results may be unreliable if these assumptions are violated.