Different techniques can be used to prepare or train the linear regression equation from data, the most common of which is called Ordinary Least Squares
Linear Regression is a supervised learning algorithm which assumes a linear relationship between the input variables (x) and the continuous output variable (y). In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables. When there is one input (explanatory or independent) variable, is called simple linear regression; for more than one, the process is called multiple linear regression.
Table of Content
- Overview
- Types of Linear Regression
- Assumptions
- Video
- Evaluation (To be added)
- Overfitting in Regression (To be added)
- Code (To be added)
Overview
The objective of Linear Regression is to predict a dependent variable value (y) based on a given independent variables (x) using a best fit line. We assume there is an linear relationship between dependent and independent variables; and dependent variable must be continuous in nature.
Let’s understand theory of linear regression using an example:
We want to predict the house prices (y) using an independent feature “size in square feet” (x).
size (sq. ft.) (x) | price ($) (y) |
---|---|
915 | 195,000 |
1550 | 290,000 |
2350 | 395,000 |
… | … |
To predict house price using above data set, Linear regression will help. Let’s suppose above line is the best fit line of linear regression (pink line in the above image), with a corresponding mathematical equation. Now, if we know the equation of that line, then for any given house size i.e. input (x), we can find any house price i.e. output (y).
Hypothesis Function
Lets use the hypothesis function formula as:
hθ(x)=θ0+θ1(x)
Cost Function
Optimization – Gradient Descent
The vertical distance between the points and the fitted line (best fit line) are called errors. The main idea of objective function is to fit this regression line by minimizing the sum of squares of these errors or Cost Function. This is also known as principle of least squares.
minimize θ0,θ1 — J(θ0,θ1)
– where J(θ0,θ1) is the cost function.
Visit this link for regression notes: Univariate Linear Regression
Assumptions of Linear Regression
Sr. No. | Assumptions | Explanation | Test |
---|---|---|---|
1 | Linearity | Linearity should be present | 1. Visualization – Scatter Plot (between dependent and independent variables) 2. Correlation with Dependent Variable |
2 | Mean of Residuals should be Zero | Residuals are the differences between the true value and the predicted value. | np.mean(y_train-y_pred)==~0 |
3 | Check for Homoscedasticity | Homoscedasticity means that the residuals have equal or almost equal variance across the regression line. By plotting the error terms with predicted terms we can check that there should not be any pattern in the error terms. There should be No heteroscedasticity | 1. Visualization – Residuals vs fitted values plot 2. Goldfeld Quandt Test 3. Bartlett’s Test |
4 | Check for Normality of residuals | Distribution for residuals should be normally distributed | Visualization – Residuals Histogram Plot |
5 | No autocorrelation of residuals | When the residuals are autocorrelated, it means that the current value is dependent of the previous (historic) values and there is a definite unexplained pattern in the Y variable that shows up in the error terms. More common in time series data. | 1. Visualization – Residuals vs fitted values line plot 2. Ljungbox test to check autocorrelation 3. Autocorrelation Plot (ACF) 4. Partial Autocorrelation Plot (PACF) |
6 | No perfect multicollinearity | Multicollinearity is presence of high correlations among two or more independent variables. | Variable Inflation Factors (VIF) |