Nonlinear Regression

Get wiggly.

Jul 28, 2024

In this week’s article, we learn how to fit nonlinear models.

Is Simplicity Best?

Last week, we calculated the closed-form equation for a simple linear model, but how do we fit a regression model if our system is more complicated than a line?

To begin, we can look at the output of a simple linear regression model where the independent variable is time and the dependent variable is the log-transformed price of the S&P 500 cash index (SPX). Note that we are claiming that time is causing price to act a certain way and not the other way around, that price depends on time.

Simple linear regression between time and price for the SPX (log-scaled).

Even a single line can explain most of the variation seen in the price of the SPX.

Take a peek at the equation in the top-left corner of the plot, this is the output of our model. Using this equation, you can plug in any year and get an approximate log price for the S&P 500.

If you look at the y-axis, it’s apparent that this is not a linear scale. We are viewing the data on a log-scale, and this is the same scale that our model was fit on. Because compounding returns grow exponentially, we have to take the logarithm of price for its relationship to time to appear linear.

Goodness of Fit

While you might have intuitively noticed that our model fits the data well, we have a couple of metrics that quantify just how well it fits. These are called “goodness-of-fit” measures. These metrics are calculated before the model output is transformed back to standard units.

First, we will look at the coefficient of determination (also known as R², pronounced “R squared”). This is the squared correlation between our data and the model prediction. For the above model, we have an R² value of 96.22%, meaning that our model explains 96.22% of the variation in the log price of the SPX.

Our next measure is the root mean squared deviation (RMSD). This is calculated by taking the difference between paired data points, squaring those deviations, summing the squared values, and finally taking the square root of that sum. This value is 0.3666 for the above model.

Think of the RMSD as a measure that we want to minimize. The smaller it is, the better our model fits the data. This is the main principle behind using an optimizer to fit our models, which we will discuss in the next section.

Optimization

Mathematical optimization attempts to either minimize or maximize the output for a given function. This is an oversimplification, but it will suffice.

We can use an analogy to understand what an optimizer is trying to do.

Imagine you are standing out in a field. Your job is to try to get as high up as possible (maximize your elevation). You see a hill nearby and start walking up it. It’s seems to be working, you are steadily gaining elevation!

Once you reach the top of the hill, there’s nothing near you that you can keep climbing. Even if you notice a bigger hill out in the distance, that would require you decrease your elevation before it will rise again.

Essentially, you have done what most simple optimizers can do. To the best of your ability, you increased your elevation but maybe you ended up on a smaller hill than what was possible.

Translating the analogy over to an actual algorithm: our optimizer will be fitting a general form of a model by which it will try to minimize the sum of squared deviations between the model output and our original data.

Here’s what that looks like in the R language:

spx <- yahoo("^SPX") %>%
    mutate(time = year(date) + yday(date) / (365 + leap_year(date)))

x <- spx$time
y <- spx$close

prm <- list(m = 0,
            b = 0)

opt <- optim(prm, function(prm) {
    prm <- as.list(prm)
    
    # fitting the model
    y_hat <- with(prm, m * x + b)

    # minimizing the sum of squared deviations
    sum((log(y) - y_hat) ^ 2)
})

as.list(opt$par)

Briefly, let’s review the code.

We scrape data from Yahoo Finance and transform our date variable into a decimalized year format that we call time. For simplicity, we assign the time and close variables to x and y. Our prm variable is the initial list of model parameters, both of which we set to zero because we don’t know what they should be. The optim function will move around the model parameters, produce new model outputs, and then calculate and minimize the sum of squares. Finally, we show the model parameters that the optim function found for us.

How did our optimizer do? For the model parameter m, we get 0.066395. For parameter b, we get -126.2376.

Our optimizer produced parameters that are very close to the closed-form regression output. Take a look at the model output in the plot above once again, do you see how close we got? We can get them even closer if we were to feed them back into the optim function (by updating the prm variable).

Nonlinear Regression

Now for the star of the show, something other than a line!

We’ve established that optimizers do a decent job of replicating simple linear regression, but that’s nothing new. Before we fit a more complex model, let’s take a look at the residuals of our simple model.

Basically, this is the information “left over” from the model that we still need to explain. I don’t know about you, but I see a sine-wave in the residuals. Generally higher around 1930, the 1960s, 2000, and now. Generally lower around the early 1940s, 1980, and 2009.

Can we find a sine-wave that fits our residuals?

$y(t) = A \sin ( wt + \phi )$

Generally, a sine wave has an amplitude (A), angular frequency (w), and phase (ϕ). Let’s add those parameters to the model and get the optimizer to fit them.

prm <- list(m = 0.0664,
            b = -126.3,
            A = 0,
            w = 0,
            o = 0)

opt <- optim(prm, function(prm) {
    prm <- as.list(prm)

    # fitting the model
    y_hat <- with(prm, (m * x + b) + (A * sin(w * x + o)))

    # minimizing the sum of squared deviations
    sum((log(y) - y_hat) ^ 2)
})

as.list(opt$par)

I ran the optim function and fed the model parameters back into prm a couple times to get these: m = 0.06658, b = -126.6, A = 0.3345, w = 0.1899, and o = -0.593. Magic!

We can now take a look at our nonlinear regression model on a line chart.

Nonlinear regression between time and price for the SPX (log-scaled).

Now we’re wiggling!

The fit of our nonlinear regression is a little better than the previous model, it has an R² value of 97.98% and RMSD of 0.2743. We are explaining more of the variation in the data and we have less total error in the residuals. That’s great.

This is as far as we’ll journey today, there’s plenty here to wrap your head around. The main take away is that by using optimization algorithms, we can fit any equation we can think of in an attempt to explain financial data. If you can understand how to do this for yourself, it is a seriously powerful skill that applies widely.

Before we wrap up, one final curio.

Market Cycles

What if we were to look only at the sine wave component of our model? Perhaps it could tell us something about how the market cycles over a long timeframe.

Sine wave component of the nonlinear regression model.

Interesting, it generally agrees with the peaks and troughs we identified.

Be sure to take this chart with a grain of salt. The model we fit is quite rudimentary and we should be careful about extrapolating its results.

That being said, the model predicts that our next major cycle top could be 2030. Previously, my analysis showed that we should reach the $10,000 level on the SPX around that time. Remember that this is from a purely technical context, no fundamentals of investing are considered.

We’ll have to stick around to find out!

Resources

Technical Appendix

The code examples above won’t work unless you have the mysterious yahoo function. For your convenience, all the code necessary to get it working is provided below!

library(dplyr)
library(httr)
library(lubridate)

CGI_YAHOO <- "https://query1.finance.yahoo.com/v7/finance/download/%s?"

concat <- function(...) {
    paste(..., sep = "", collapse = "")
}

clean_names_vec <- function(x) {
    x %>%
        gsub("[^0-9A-Za-z]+", "_", .) %>%
        trimws("both", "_") %>%
        gsub("^(\\d)", "x\\1", .) %>%
        tolower()
}

clean_names <- function(x) {
    names(x) <- names(x) %>%
        clean_names_vec()
    x
}

as_unix_time <- function(date, origin = "1970-01-01") {
    date %>%
        as.POSIXct("UTC", origin = origin) %>%
        as.numeric()
}

api_kwargs <- function(kwargs) {
    paste(names(kwargs), kwargs, sep = "=", collapse = "&")
}

api_format <- function(cgi, kwargs) {
    concat(cgi, api_kwargs(kwargs))
}

api_yahoo <- function(symbol, interval = "1d", events = "history", date0 = "1900-01-01", date1 = today()) {
    kwargs <- c(period1 = as_unix_time(date0),
                period2 = as_unix_time(date1),
                interval = interval,
                events = events)

    CGI_YAHOO %>%
        sprintf(symbol) %>%
        api_format(kwargs)
}

get_yahoo <- function(url) {
    url %>%
        GET() %>%
        content(na = "null") %>%
        clean_names() %>%
        arrange(desc(date))
}

yahoo <- function(symbol, interval = "1d", events = "history", date0 = "1900-01-01", date1 = lubridate::today() + 1) {
    df <- symbol %>%
        api_yahoo(interval, events, date0, date1) %>%
        get_yahoo()

    data.frame(symbol) %>%
        cbind(df) %>%
        as_tibble()
}

If you have issues, email me and we can try to troubleshoot your problem.

Nonlinear Regression

Get wiggly.

Is Simplicity Best?

Goodness of Fit

Optimization

Nonlinear Regression

Market Cycles

Resources

Technical Appendix

Discussion about this post