Simple Linear Regression

Simple is best.

Jul 21, 2024

In this week’s article, we learn how to fit closed-form regression models.

Preamble

We should learn about linear models. If we can fit models to financial data, we are entering into the territory of “quantitative analysis.” No longer will we draw trendlines on charts by hand, we’ll let math do that for us!

Regression Therapy

Perhaps you learned this simple formula for a linear model back in high school:

\(\hat y=mx+b\)

The output of our function is the variable “y hat,” or the model prediction. To get our output, two transformations are applied to our inputs (x). These are the slope (m) and the y-intercept (b). Our task is to find values for m and b such that our x values are transformed into predictions that most closely resemble the original y values.

The big question is, “How do we find values for m and b?”

We can begin with the slope.

\(m = {\text{COV}(x, y) \over \sigma_y}\)

If we can calculate the covariance between x and y, then divide it by the standard deviation of the y values, we can determine our slope.

What about the intercept?

\(b = \bar y - m \bar x\)

There are many ways to calculate this, but I prefer to use the mean of the y values, or “y bar,” and the mean of the x values as a set of points to determine the intercept. Notice this is essentially a rearrangement of the first formula!

We’re doing great, but we need to know how to calculate covariance and standard deviation. We will tackle covariance first.

\(\text{COV}(x, y) = {\sum_{i=0}^{n} (x_i - \bar x)(y_i - \bar y) \over n}\)

Looks ugly, but don’t worry!

The new n variable simply refers to the length of our dataset. Both “x sub i” and “y sub i” refer to a single point of data within our set. When we subtract the mean from a data point, we have a deviation.

Essentially, for each pair of x and y points in our dataset, we take their respective deviations, and multiply the deviations together. Once we do that for the whole set, we sum those products and divide by the length.

Now, we still need to know how to calculate the standard deviation and might as well refresh what a mean is. We can turn to the standard deviation of y now:

\( \sigma_y = {\sum_{i=0}^{n} (y_i - \bar y) \over n}\)

The standard deviation of y is the sum of the mean deviations of y over the length of the set. Only the mean remains!

\(\bar y = {\sum_{i=0}^n y_i \over n}\)

The mean is the easiest to understand. We take the sum of each point in the y dataset and divide by the length of the set. For the mean and standard deviation, we calculate the respective x statistics in the exact same way but with the x data.

If you made it this far, you now know how to calculate a linear regression!

Discussion about this post

Ready for more?