Linear Regression

Linear regression performs the task to predict a dependent variable value (a response variable called ) based on a given independent variable (predictor variable called ). So, this regression technique finds out a Linear Relationship(correlation) between (input) and (output).

Training Linear Regression:
If hypothesis function for Linear Regression is :

  • : independent variable, input training data
  • : dependent variable, Prediction value
  • : as weight(coefficient of x), the slope of the line in linear graph
  • : as bias(intercept variable)

Training task requires finding the best fit line to predict the value of for a given value of and values. By achieving the best-fit regression line, the model aims to predict value such that the error difference between predicted value and true value is minimum. To do so, we need to find the best value that minimize the error. this is done using Cost Function, namely “sum of squared error”.


Concepts:

  • Intercept: Point where the regression line crosses the y axis.
  • Slope: The inclination of the regression line.
  • Extrapolation: estimated regression equation to estimate a mean() or to predict a new response() for x values.
    • Extrapolation beyond the scope of the model(range of the sample data) is considered dangerous because the estimated regression equation often doesn’t provide accurate or even meaningful output outside the scope of the model.
  • Multicollinearity: when two or more variables have very similar variance, so they behave the same in those terms.
    • Multicollinear variables create similar variances causing redundancy in the model and making it less reliable.
    • Correlation tests are used to identify Multicollinearity and this types of variable are removed from the data model.
  • Residuals: The residual(also called Error Term, ) is the difference between the predicted value() and the observed value() and is a measure of assumptions in regression. it’s calculated as .
  • Residuals must form a normal distribution and features should be correlated.
  • Multiple Linear Regression: It is a regression model with two or more independent variables and one dependent variable.

Notes:


Assumptions in linear regression:


Advantages:

  • It perform very well on linearly separable data.
  • It's is easy to understand and visualize.
  • It's very fast.

Disadvantages:

  • The data can have complex relationships that are not easy to capture in a linear model.
  • It's prone to Overfitting.