Iterated Reweighted Least Squares and GLMs Explained | by Gerry Christian Ongko | Jul, 2022


With a detailed implementation in Python

Image By Tine Ivanič on Unsplash

Generalised Linear Models (GLM) are regression models where we generalise the linear assumption of the usual linear regression model. Because of this non-linearity, estimating the regression parameter will not be as simple as estimating a linear regression parameter.

The Iterated Reweighted Least Squares (IRLS) algorithm or sometimes also Iterated Weighted Least Squares (IWLS), is a method to find the maximum likelihood estimates of generalised linear models. It is an extension of the weighted least squares method. Let’s start with a short background introduction.

In a linear model, we can estimate the parameter of the regression using the normal equations,

Ordinary least squares estimate

This method will have errors with means of zero and constant variance.

Linear error distribution

If the relationship between the predictors and the predicted is not linear, we will obtain errors with an unconstant variance if we insist on using the above normal equation.

Non-linear error distribution

This result is not desirable. We want to have constant error variance for better predictability. With some mathematical manipulation that I will not show, we can turn the error distribution to a standard multivariate normal using the weighted least squares method.

Weighted least squares estimate

We still cannot use this for GLMs for a specific reason. The y-variable of a GLM is different from the predicted variable. Consider the data below (which we will also use for the code).

Here we see that the y-variable is a count variable indicating the number of successes. We keep the maximum number of successes to one for our case, making this a logistic regression problem. Binomial (and logistic) regression does not predict the count of success of an instance but the probability of success. This difference between the y-variable and the desired predicted variable is problematic because we cannot put our y-variable to the weighted least squares equation. To overcome this issue, we introduce the IRLS algorithm. But before we jump to the algorithm, I need to explain the basics of the GLM.

As the name suggests, GLMs are a generalisation of the linear regression where the predicted variable is related to the linear model through a link function denoted by the letter g.

GLMs

The link function is handy if your predicted variable is of a finite domain. An example is if you want to model probability as your response variable. As you know, the value of probabilities can only be between 0 and 1. Ordinary linear regression could not satisfy this domain limitation. Hence we introduce the link function. The form below is usually easier to comprehend.

GLMs

Here, the μ is the probability to be predicted. But you can use this to model any other variable with any other domain limitation with the appropriate selection of the link function.

Note. The link function has to be differentiable.

Let’s talk about the link function. For our case, we need to find a function that transforms an infinite domain into a [0, 1] domain. Many functions have this property. I will show you the most popular link function for this transformation.

Logit function

The above function is famously known as the logit function. The logit link function works for binary y-variables to predict probabilities.

Note. GLMs are only valid if the y-variable is from an exponential family distribution.

The logistic regression has a binary y-variable with values of either 1 or 0. We can say that this data comes from a Bernoulli distribution with a non-constant probability parameter. This probability is what we are trying to model. Because the Bernoulli or Binomial distribution is from the exponential family, we can model this through a GLM. But what is this exponential family?

A distribution comes from the exponential family if their density or mass function can be represented by the following form.

Exponential family

I will show how to express the Binomial distribution in this form.

Proofing Binomial distribution as a part of exponential family

There are other ways to represent the parameters of the exponential family as they are non-unique. I went through all this length to explain the exponential family because we require the concept of variance function from one of the parameters.

Variance function

I understand this is rather tedious to find, so I will offer a more straightforward way to derive the variance function. In essence, the variance function is a way to represent the variance of a distribution in terms of its mean.

Variance definition

The mean of a Binomial distribution is np while the variance is np(1-p). Since we know thata(ϕ) = 1 from the derivation above, I can represent my variance function as follows.

Variance function of binomial distribution

If we substitute the μ with np, we will obtain the variance which is np(1-p). Suppose there is no direct link between the mean and variance like in the example of the Normal distribution. In that case, you will have to find the variance function manually using the formula.

We start by defining a ‘new’ regression problem,

Equation 1. New regression

We can estimate the parameters if we know the variance of the error. The variance of the error can be obtained by the following formula.

Equation 2. Variance of errors

Note that for GLMs, the a(ϕ) term can be ignored entirely because it will cancel out in the following calculation.

From here, we can estimate the parameters using a weighted least squares solution. This is where the a(ϕ) cancels out.

Equation 3. Weighted least squares

Using the current parameter estimate, we find the new value of μ,

Equation 4. Find μ

Recall that we defined z and Σare functions of μ, μ is a function of β, and β is a function of z and Σ. We estimate the parameters by iterating over this recursive relationship.

The algorithm stops if the increase in log-likelihood is no longer significant.

Equation 5. Log likelihood of logistic regression.

Note that this log-likelihood equation is only valid for the logistic regression. For other distributions, the formula will depend on the chosen link function.

Let’s compile. The Iterated Reweighted Least Squares algorithm:

  1. Initialise μ within the defined domain. I will initialise with an array of 0.5probabilities.
  2. Given the current value of μ, calculate z and Σ using equation 1 and equation 2.
  3. Given the current value of z and Σ, calculate β using the weighted least squares formula; equation 3.
  4. Given the current value of β, calculate μ using equation 4.
  5. If the increase in log-likelihood (equation 5) is smaller than a prespecified epsilon, stop. Otherwise, return to step 2.

Let’s gather all the puzzle pieces together. Firstly, the link function.

Next, our algorithm pieces,

Here is my code for the algorithm. I match the sections with the steps outlined above. Please try to match each equation above.

Note that the X in the parameter requires an intercept column. Running the following line of code,

b = IWLS(X, y)

Would return the parameter estimates to the b variable.

IWLS completed on 8 iterations
array([1.10999575, 9.12479957, 2.18745957])

We could use this information to predict the probability and success.

phat = ig(X @ b)
yhat = np.where(phat >= 0.5, 1, 0)

Here phat stores the probability predictions while yhat stores the binomial prediction.

As far as my knowledge, the Python scikit-learn library has no IRLS solver, so we can’t compare our results to theirs. But luckily, the R faraway package does use IRLS as their GLM solver. Let’s compare.

Running

summary(model)

Would return,

Call:
glm(formula = Success ~ ., family = binomial, data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.72517 -0.12458 -0.00007 0.48620 1.07683
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.110 1.283 0.865 0.387
Height 9.125 8.489 1.075 0.282
Weight 2.187 2.099 1.042 0.297
(Dispersion parameter for binomial family taken to be 1)Null deviance: 13.8629 on 9 degrees of freedom
Residual deviance: 5.6807 on 7 degrees of freedom
AIC: 11.681
Number of Fisher Scoring iterations: 8

Recall our previous output,

IWLS completed on 8 iterations
array([1.10999575, 9.12479957, 2.18745957])

This is the exact parameter estimate and iteration number. I do not know the actual stopping condition that they are using, but it seems our methods conform pretty well.

You have successfully finished this article. See my complete code on GitHub to see how to properly preprocess the data and use our functions for practical usage. Please do share any feedback and thoughts in the comments. Thank you for reading!


With a detailed implementation in Python

Image By Tine Ivanič on Unsplash

Generalised Linear Models (GLM) are regression models where we generalise the linear assumption of the usual linear regression model. Because of this non-linearity, estimating the regression parameter will not be as simple as estimating a linear regression parameter.

The Iterated Reweighted Least Squares (IRLS) algorithm or sometimes also Iterated Weighted Least Squares (IWLS), is a method to find the maximum likelihood estimates of generalised linear models. It is an extension of the weighted least squares method. Let’s start with a short background introduction.

In a linear model, we can estimate the parameter of the regression using the normal equations,

Ordinary least squares estimate

This method will have errors with means of zero and constant variance.

Linear error distribution

If the relationship between the predictors and the predicted is not linear, we will obtain errors with an unconstant variance if we insist on using the above normal equation.

Non-linear error distribution

This result is not desirable. We want to have constant error variance for better predictability. With some mathematical manipulation that I will not show, we can turn the error distribution to a standard multivariate normal using the weighted least squares method.

Weighted least squares estimate

We still cannot use this for GLMs for a specific reason. The y-variable of a GLM is different from the predicted variable. Consider the data below (which we will also use for the code).

Here we see that the y-variable is a count variable indicating the number of successes. We keep the maximum number of successes to one for our case, making this a logistic regression problem. Binomial (and logistic) regression does not predict the count of success of an instance but the probability of success. This difference between the y-variable and the desired predicted variable is problematic because we cannot put our y-variable to the weighted least squares equation. To overcome this issue, we introduce the IRLS algorithm. But before we jump to the algorithm, I need to explain the basics of the GLM.

As the name suggests, GLMs are a generalisation of the linear regression where the predicted variable is related to the linear model through a link function denoted by the letter g.

GLMs

The link function is handy if your predicted variable is of a finite domain. An example is if you want to model probability as your response variable. As you know, the value of probabilities can only be between 0 and 1. Ordinary linear regression could not satisfy this domain limitation. Hence we introduce the link function. The form below is usually easier to comprehend.

GLMs

Here, the μ is the probability to be predicted. But you can use this to model any other variable with any other domain limitation with the appropriate selection of the link function.

Note. The link function has to be differentiable.

Let’s talk about the link function. For our case, we need to find a function that transforms an infinite domain into a [0, 1] domain. Many functions have this property. I will show you the most popular link function for this transformation.

Logit function

The above function is famously known as the logit function. The logit link function works for binary y-variables to predict probabilities.

Note. GLMs are only valid if the y-variable is from an exponential family distribution.

The logistic regression has a binary y-variable with values of either 1 or 0. We can say that this data comes from a Bernoulli distribution with a non-constant probability parameter. This probability is what we are trying to model. Because the Bernoulli or Binomial distribution is from the exponential family, we can model this through a GLM. But what is this exponential family?

A distribution comes from the exponential family if their density or mass function can be represented by the following form.

Exponential family

I will show how to express the Binomial distribution in this form.

Proofing Binomial distribution as a part of exponential family

There are other ways to represent the parameters of the exponential family as they are non-unique. I went through all this length to explain the exponential family because we require the concept of variance function from one of the parameters.

Variance function

I understand this is rather tedious to find, so I will offer a more straightforward way to derive the variance function. In essence, the variance function is a way to represent the variance of a distribution in terms of its mean.

Variance definition

The mean of a Binomial distribution is np while the variance is np(1-p). Since we know thata(ϕ) = 1 from the derivation above, I can represent my variance function as follows.

Variance function of binomial distribution

If we substitute the μ with np, we will obtain the variance which is np(1-p). Suppose there is no direct link between the mean and variance like in the example of the Normal distribution. In that case, you will have to find the variance function manually using the formula.

We start by defining a ‘new’ regression problem,

Equation 1. New regression

We can estimate the parameters if we know the variance of the error. The variance of the error can be obtained by the following formula.

Equation 2. Variance of errors

Note that for GLMs, the a(ϕ) term can be ignored entirely because it will cancel out in the following calculation.

From here, we can estimate the parameters using a weighted least squares solution. This is where the a(ϕ) cancels out.

Equation 3. Weighted least squares

Using the current parameter estimate, we find the new value of μ,

Equation 4. Find μ

Recall that we defined z and Σare functions of μ, μ is a function of β, and β is a function of z and Σ. We estimate the parameters by iterating over this recursive relationship.

The algorithm stops if the increase in log-likelihood is no longer significant.

Equation 5. Log likelihood of logistic regression.

Note that this log-likelihood equation is only valid for the logistic regression. For other distributions, the formula will depend on the chosen link function.

Let’s compile. The Iterated Reweighted Least Squares algorithm:

  1. Initialise μ within the defined domain. I will initialise with an array of 0.5probabilities.
  2. Given the current value of μ, calculate z and Σ using equation 1 and equation 2.
  3. Given the current value of z and Σ, calculate β using the weighted least squares formula; equation 3.
  4. Given the current value of β, calculate μ using equation 4.
  5. If the increase in log-likelihood (equation 5) is smaller than a prespecified epsilon, stop. Otherwise, return to step 2.

Let’s gather all the puzzle pieces together. Firstly, the link function.

Next, our algorithm pieces,

Here is my code for the algorithm. I match the sections with the steps outlined above. Please try to match each equation above.

Note that the X in the parameter requires an intercept column. Running the following line of code,

b = IWLS(X, y)

Would return the parameter estimates to the b variable.

IWLS completed on 8 iterations
array([1.10999575, 9.12479957, 2.18745957])

We could use this information to predict the probability and success.

phat = ig(X @ b)
yhat = np.where(phat >= 0.5, 1, 0)

Here phat stores the probability predictions while yhat stores the binomial prediction.

As far as my knowledge, the Python scikit-learn library has no IRLS solver, so we can’t compare our results to theirs. But luckily, the R faraway package does use IRLS as their GLM solver. Let’s compare.

Running

summary(model)

Would return,

Call:
glm(formula = Success ~ ., family = binomial, data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.72517 -0.12458 -0.00007 0.48620 1.07683
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.110 1.283 0.865 0.387
Height 9.125 8.489 1.075 0.282
Weight 2.187 2.099 1.042 0.297
(Dispersion parameter for binomial family taken to be 1)Null deviance: 13.8629 on 9 degrees of freedom
Residual deviance: 5.6807 on 7 degrees of freedom
AIC: 11.681
Number of Fisher Scoring iterations: 8

Recall our previous output,

IWLS completed on 8 iterations
array([1.10999575, 9.12479957, 2.18745957])

This is the exact parameter estimate and iteration number. I do not know the actual stopping condition that they are using, but it seems our methods conform pretty well.

You have successfully finished this article. See my complete code on GitHub to see how to properly preprocess the data and use our functions for practical usage. Please do share any feedback and thoughts in the comments. Thank you for reading!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
Ai NewsChristianexplainedGerryGLMsIteratedJulOngkoReweightedSquaresTech NewsTechnology
Comments (0)
Add Comment