Assumptions of Linear Regression
When performing a linear regression, there a couple assumptions you need to ensure are true in order to create respectable results.
- Linear relationship between the target variable and predictor variable.
- Errors are independent.
- Errors are homoskedastic, or that they all have the same variance.
- Errors are normally distributed.
In this short post we will go over what linear relationships are and how we can improve relationships between our variables to improve our linear regressions.
For linear regression to be effective, we must assume that we have a linear relationship. A linear relationship between two numerical variables is a relationship where the change in the value of one variable causes an constant effect in the other variable. For example, if variables X and Y are in a linear relationship and we increase the value of X by 1, we expect the value of Y to increase by a certain amount. If we further increase the value of X by 1, we can expect the value of Y to increase by the same, constant amount. We can utilize this property for linear regression predictions as if we know a predictor variable that is in a linear relationship with the target variable, we can easily predict the target variable by checking the status of the predictor variable. If the predictor variable goes up, the target variable will also go up by a constant amount; if the predictor variable goes down, the target variable will also go down by a constant amount. Thus if we want to predict our target variable at a certain value of our predictor variable, it is a simple step of tracking the predictor variable until we reach our desired value.
The best way to check if two numerical variables are in a linear relationship would be to plot the two variables, typically as a scatter-plot. If the plot results in a perfectly straight diagonal line, then we can be certain that it is a linear relationship.
The line can also be decreasing while still being a linear relationship. This is known as a negative linear relationship, where if one variable increase, the other will decrease at the same rate.
If we get a plot that is a line but not a straight line, this is a non-linear relationship. This means that the two variables have a relationship, such as if one variable increases the other one will too, but it is not linear as both are increasing at different rates.
If the plot does not show any resemblance of being a line, then we can say that there is no relationship between the two variables.
While our desired goal is a linear relationship, such a perfect specimen is hard to discover in the wild in the majority of datasets. Variables that have no relationship have little use for us so we can safely disregard it most of the time. Thus we can put our main focus on the non-linear relationships. While we can still use a non-linear relationship for linear regression, the accuracy of it greatly decreases depending on how far our plotted line deviates from being a straight line. In our example above, as the x-value increase, the y-value exponentially increases; thus the higher our x-value, the less accurate our y-value will become.
To combat this, we can manipulate our inputs to push our curvy non-linear line into a straighter form. The straighter we can get the line, the better the accuracy of our linear regression will be. The best process for this would be to utilize non-linear transformation on one or more of our variables. Linear transformations involve transformations that change the values but preserves the relationships between variables. Examples of this would be arithmetic methods such as adding, multiplying, dividing, or subtracting. While these would change the direction or angle our line points, it does not help our case at all.
Non-linear transformations on the other hand are transformations that transform both the values and the relationships. Typical ways to undergo this would be transform one or more of the variables by taking the log of it, squaring it, finding the square root, or by using the reciprocal. Taking our previous example from earlier, we can undergo the non-linear transformation of taking the square root of our y-values.
Our results will turn our parabolic line into a perfectly straight one!
As long as we take into account the square root transformation we did for all further calculations, this variable can be utilized greatly in our linear regression technique, as it is an ideal linear relationship. While most real-world examples will not have such a simple solution as this example, it is a quick showcase on the mindset you would have to improve the linear relationships between variables. This will in turn raise the accuracy of a linear regression and improve your predictions!