Techno Blender
Digitally Yours.

Confidence Interval vs Prediction Interval: What is the Difference? | by Aaron Zhu | Dec, 2022

0 41


Photo by J Scott Rakozy on Unsplash

Confidence intervals and prediction intervals are two types of interval estimates that are used in statistical analysis to quantify the uncertainty associated with a given estimate. Both types of intervals provide a range of values within which the true value of a parameter is likely to lie, with a specified level of confidence. However, there are some key differences between confidence intervals and prediction intervals, which are important to understand in order to choose the appropriate interval for a given situation.

Let’s create a simple example of a linear regression model, by which we try to predict the house price in Los Angeles (i.e., Y) based on the square footage of a house (i.e., X). We can write this linear regression in the following format.

Equation 1

But for this topic, let’s rewrite this equation by centering the explanatory variable on its mean (i.e., subtract each explanatory variable by its mean).

Equation 2

The effect is the slope (β) doesn’t change at all because centering only shifts the scale in the equation by subtracting a constant (β*x_bar), but the value of the intercept (α) does change.

Why do we center the explanatory variable on its mean in a linear regression model?

When the explanatory variable is not centered, the intercept term in the model represents the predicted value of the response variable when the explanatory variable is equal to zero. This value is meaningless.

Instead, If the explanatory variable is centered on its mean, the intercept term becomes the mean value of the response variable, which makes it more intuitive to interpret.

What is a Confidence Interval for the Mean Response?

A Confidence Interval is an interval estimate for predicting the average response value for a given set of values of explanatory variables.

A Confidence Interval pertains to the sampling uncertainty from the OLS estimators, α^ and β^.

α and β are coefficients (or parameters) in the linear regression model. They are usually unknown to us because in many cases it is impossible to collect all data on the population to compute their values. Instead, we can only rely on the sample data to compute OLS estimators, α^ and β^ to estimate α and β. If we collect a different set of data and fit the model again, we will likely get different values of α^ and β^. This uncertainty of α^ and β^ (aka, the sampling uncertainty) is one of the sources of uncertainty for the predicted response value.

In the context of prediction, a confidence interval gives us a range of values for the AVERAGE response value for a given set of values of explanatory variables.

For example, if we would like to estimate the average house price in Los Angeles with 2000 square feet, then we’re talking about Confidence Interval.

How to compute a Confidence Interval for the average response value?

We would need to know both the expected value and variance of the average response value (y^) to compute the confidence interval.

We know the OLS estimates for the linear regression model in equation 2 is (see proof here)

Image by author

The expected value (aka, the point estimate) of the average response value for a given value explanatory variable (x) is

Image by author

To compute the variance of the average response value, we need to obtain the sampling distributions of α^ and β^, especially, their variances. They are (see proof here)

Image by author

then, we can compute the variance of the average response value:

Image by author

Here σ² is the variance of the error term, which is typically unknown. We can estimate its value using the mean square error (MSE or S²) from the sample data.

Image by author

Finally, we can compute the confidence interval for the average response:

Image by author

What is a Prediction Interval for a New Response?

A Prediction Interval is an interval estimate for predicting a new response value or a future observation for a given set of values of explanatory variables.

A Prediction Interval is wider than the Confidence Interval. Because not only it includes the sampling uncertainty from the OLS estimators, α^ and β^, but it also accounts for the uncertainty from the irreducible error, ε, which is not explained by the linear regression model.

In the context of prediction, a prediction interval gives us a range of values for ANY possible response value for a given set of values of explanatory variables.

For example, if we would like to estimate the value of a random house in Los Angeles with 2000 square feet, then we’re talking about Prediction Interval.

How to compute a Prediction Interval for a New Response?

We would need to know both the expected value and variance of the new response variable (y).

we know that

Image by author

and

Image by author

Therefore, the expected value of the new response variable is

Image by author

This is identical to the expected value of the average response. and the variance of the new response variable is

Image by author

Finally, we can compute the prediction interval for the new response:

Image by author

Why is a Prediction Interval wider than a Confidence Interval?

Mathematically, from the formulas, we can see that the Prediction Interval includes the extra term, σ² to account for the variance of the error term

Intuitively, in our example, house prices could vary due to other factors NOT included in the regression model, such as the location, condition of the house, mortgage interest rate, and other unobserved factors. These excluded variables will be absorbed in the error term, ε. The prediction interval would need to account for the uncertainty of these excluded variables. Therefore, the prediction interval has a wider range than the confidence interval for the same value of explanatory variables.

What are the factors that determine the width of the Confidence and Prediction Intervals?

From the formulas, we can see that

  • As the MSE decreases, then the range of interval decreases. To have a smaller MSE in a linear regression model, we need to assure the appropriateness of the model and include relevant and meaningful predictors.
  • As the t-multiplier decrease, the confidence level decreases, then the range of interval decreases.
  • As the sample size increase, then the range of interval decreases.
  • The higher the variance of the predictors, the narrow the intervals. Intuitively, the more information the predictors can provide for the model, the more precise the interval estimates.
  • The closer the input of predictors to their means, the narrow the intervals. Intuitively, the linear regression model is more precise at predicting when predictors are around the means. Therefore, we would expect interval estimates to have an “Hourglass” Shape.

How to compute the Confidence Interval and Prediction Interval in a Multiple Linear Regression (MLR) model

Usually, we will deal with a linear regression model with multiple predictors. The confidence interval and prediction interval for MLR are very similar to simple linear regression.

The general formula for Confidence Interval in MLR is

Image by author

The general formula for Prediction Interval in MLR is

Image by author

Summary

Confidence intervals and prediction intervals are both interval estimates that provide a range of values within which a true value is likely to lie, with a specified level of confidence. However, confidence intervals are used to estimate a population parameter, while prediction intervals are used to predict the value of a future observation. Confidence intervals are typically narrower than prediction intervals because they only include the uncertainty associated with estimating the population parameter, while prediction intervals include the additional uncertainty associated with predicting an individual value. It is important to choose the appropriate interval estimate depending on the specific statistical question being asked and the type of data being analyzed.

If you would like to explore more posts related to Statistics, please check out my articles:

If you enjoy this article and would like to Buy Me a Coffee, please click here.

You can sign up for a membership to unlock full access to my articles, and have unlimited access to everything on Medium. Please subscribe if you’d like to get an email notification whenever I post a new article.


Photo by J Scott Rakozy on Unsplash

Confidence intervals and prediction intervals are two types of interval estimates that are used in statistical analysis to quantify the uncertainty associated with a given estimate. Both types of intervals provide a range of values within which the true value of a parameter is likely to lie, with a specified level of confidence. However, there are some key differences between confidence intervals and prediction intervals, which are important to understand in order to choose the appropriate interval for a given situation.

Let’s create a simple example of a linear regression model, by which we try to predict the house price in Los Angeles (i.e., Y) based on the square footage of a house (i.e., X). We can write this linear regression in the following format.

Equation 1

But for this topic, let’s rewrite this equation by centering the explanatory variable on its mean (i.e., subtract each explanatory variable by its mean).

Equation 2

The effect is the slope (β) doesn’t change at all because centering only shifts the scale in the equation by subtracting a constant (β*x_bar), but the value of the intercept (α) does change.

Why do we center the explanatory variable on its mean in a linear regression model?

When the explanatory variable is not centered, the intercept term in the model represents the predicted value of the response variable when the explanatory variable is equal to zero. This value is meaningless.

Instead, If the explanatory variable is centered on its mean, the intercept term becomes the mean value of the response variable, which makes it more intuitive to interpret.

What is a Confidence Interval for the Mean Response?

A Confidence Interval is an interval estimate for predicting the average response value for a given set of values of explanatory variables.

A Confidence Interval pertains to the sampling uncertainty from the OLS estimators, α^ and β^.

α and β are coefficients (or parameters) in the linear regression model. They are usually unknown to us because in many cases it is impossible to collect all data on the population to compute their values. Instead, we can only rely on the sample data to compute OLS estimators, α^ and β^ to estimate α and β. If we collect a different set of data and fit the model again, we will likely get different values of α^ and β^. This uncertainty of α^ and β^ (aka, the sampling uncertainty) is one of the sources of uncertainty for the predicted response value.

In the context of prediction, a confidence interval gives us a range of values for the AVERAGE response value for a given set of values of explanatory variables.

For example, if we would like to estimate the average house price in Los Angeles with 2000 square feet, then we’re talking about Confidence Interval.

How to compute a Confidence Interval for the average response value?

We would need to know both the expected value and variance of the average response value (y^) to compute the confidence interval.

We know the OLS estimates for the linear regression model in equation 2 is (see proof here)

Image by author

The expected value (aka, the point estimate) of the average response value for a given value explanatory variable (x) is

Image by author

To compute the variance of the average response value, we need to obtain the sampling distributions of α^ and β^, especially, their variances. They are (see proof here)

Image by author

then, we can compute the variance of the average response value:

Image by author

Here σ² is the variance of the error term, which is typically unknown. We can estimate its value using the mean square error (MSE or S²) from the sample data.

Image by author

Finally, we can compute the confidence interval for the average response:

Image by author

What is a Prediction Interval for a New Response?

A Prediction Interval is an interval estimate for predicting a new response value or a future observation for a given set of values of explanatory variables.

A Prediction Interval is wider than the Confidence Interval. Because not only it includes the sampling uncertainty from the OLS estimators, α^ and β^, but it also accounts for the uncertainty from the irreducible error, ε, which is not explained by the linear regression model.

In the context of prediction, a prediction interval gives us a range of values for ANY possible response value for a given set of values of explanatory variables.

For example, if we would like to estimate the value of a random house in Los Angeles with 2000 square feet, then we’re talking about Prediction Interval.

How to compute a Prediction Interval for a New Response?

We would need to know both the expected value and variance of the new response variable (y).

we know that

Image by author

and

Image by author

Therefore, the expected value of the new response variable is

Image by author

This is identical to the expected value of the average response. and the variance of the new response variable is

Image by author

Finally, we can compute the prediction interval for the new response:

Image by author

Why is a Prediction Interval wider than a Confidence Interval?

Mathematically, from the formulas, we can see that the Prediction Interval includes the extra term, σ² to account for the variance of the error term

Intuitively, in our example, house prices could vary due to other factors NOT included in the regression model, such as the location, condition of the house, mortgage interest rate, and other unobserved factors. These excluded variables will be absorbed in the error term, ε. The prediction interval would need to account for the uncertainty of these excluded variables. Therefore, the prediction interval has a wider range than the confidence interval for the same value of explanatory variables.

What are the factors that determine the width of the Confidence and Prediction Intervals?

From the formulas, we can see that

  • As the MSE decreases, then the range of interval decreases. To have a smaller MSE in a linear regression model, we need to assure the appropriateness of the model and include relevant and meaningful predictors.
  • As the t-multiplier decrease, the confidence level decreases, then the range of interval decreases.
  • As the sample size increase, then the range of interval decreases.
  • The higher the variance of the predictors, the narrow the intervals. Intuitively, the more information the predictors can provide for the model, the more precise the interval estimates.
  • The closer the input of predictors to their means, the narrow the intervals. Intuitively, the linear regression model is more precise at predicting when predictors are around the means. Therefore, we would expect interval estimates to have an “Hourglass” Shape.

How to compute the Confidence Interval and Prediction Interval in a Multiple Linear Regression (MLR) model

Usually, we will deal with a linear regression model with multiple predictors. The confidence interval and prediction interval for MLR are very similar to simple linear regression.

The general formula for Confidence Interval in MLR is

Image by author

The general formula for Prediction Interval in MLR is

Image by author

Summary

Confidence intervals and prediction intervals are both interval estimates that provide a range of values within which a true value is likely to lie, with a specified level of confidence. However, confidence intervals are used to estimate a population parameter, while prediction intervals are used to predict the value of a future observation. Confidence intervals are typically narrower than prediction intervals because they only include the uncertainty associated with estimating the population parameter, while prediction intervals include the additional uncertainty associated with predicting an individual value. It is important to choose the appropriate interval estimate depending on the specific statistical question being asked and the type of data being analyzed.

If you would like to explore more posts related to Statistics, please check out my articles:

If you enjoy this article and would like to Buy Me a Coffee, please click here.

You can sign up for a membership to unlock full access to my articles, and have unlimited access to everything on Medium. Please subscribe if you’d like to get an email notification whenever I post a new article.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment