Stat Stories: Delta Method in Statistics | by Rahul Bhadani | Nov, 2022

By Jessie Hobb On Nov 19, 2022

A commonly overlooked topic by machine learning practitioners

Cover photo generated by the author using an AI tool Dreamstudio.

Data sampling is at the core of data science. From a given population f(x), we sample data points. All these data points are collectively called random samples denoted by random variable X. But as we know, data science is a game of probability, often, we repeat the experiment many times. In such a scenario, we end up with n random samples X₁, X₂, … Xₙ (not to be confused with the number of data points in a sample). Often these random samples are independent, but identically distributed, hence, they are called independent and identically distributed random variables with pdf or pmf f(x), or iid random variables.

In this article, we talk about the Delta method which provides a mathematical framework for calculating limiting distribution and asymptotic variance, given iid samples. The Delta method lets you calculate the variance of a function of a random variable (with some transformation as we will see later) whose variance is known. This framework is closely related to the variable transformation method in statistics that I have previously talked about in much detail.

Given iid random samples X₁, X₂, … Xₙ, their joint pdf is given by

Equation 1: Joint PDF of iid random variables

Of special case, if all iid samples (we are dropping ‘random’ but assume that they are there) are normally distributed with mean and variance as 0, and 1, then X² ~ χ²₁, i.e. chi-square distribution of degree of freedom equal to 1. (It can be tested by writing a simple script in Python, R, or Julia).

Convergence

Convergence in distribution tells us how Xₙ converges to some limiting distribution as n → ∞. We can talk about convergence at various levels:

Convergence in probability: A sequence of random variables X₁, X₂, … Xₙ →ₚ X if for every ε> 0,

where →ₚ denotes convergence in probability. One such use of convergence in probability is the weak law of large numbers. For iid X₁, X₂, … Xₙ with 𝔼(X) = μ, and var(X) < ∞, then (X +, X₂+ … + Xₙ)/n →ₚ μ.

2. Almost Sure Convergence: We say that Xₙ → X a.s. (almost sure) if

Almost sure convergence implies convergence in probability but vice-versa is not true. The strong law of large numbers is the result of almost sure convergence where 𝔼(X) = μ, var(X) = σ², then (X +, X₂+ … + Xₙ)/n → μ, a.s.

3. Convergence in Distribution: We say Xₙ → X if the sequence of distribution functions F_{Xₙ} of Xₙ converge to that of X in an appropriate sense: F_{Xₙ}(x) → F_{X}(x) for all x, where F_{X} is continuous (Note that my writing style used latex notation in absence of Medium not able to support complicated equations).

Convergence in distribution is the property of distribution and not a particular random variable that is different from the previous two distributions. Convergence in Moment Generate Function implies convergence in distribution, i.e. M_{X_n}(t) → M_X(t) for all t in a neighborhood of 0.

Central Limit Theorem is one application of convergence in distribution where, for X₁, X₂, … Xₙ with mean μ and variance σ²,

Equation 4. Normal distribution through the Central Limit Theorem, a consequence of convergence in distribution.

Another consequence of convergence in distribution is Slutsky Theorem:

If Xₙ → X in distribution, and Yₙ → c in distribution, with c a constant, then Xₙ + Yₙ → X + c, Xₙ Yₙ → cX, and Xₙ /Yₙ → X/c, c ≠0, all in distribution.

Delta method, through convergence properties and the Taylor series, approximates the asymptotic behavior of the functions of a random variable. Through variable transformation methods, it is easy to see that if Xₙ is asymptotically normal, then any smooth function g(Xₙ) is also asymptotically normal. Delta method may be used in such situations to calculate the asymptotic distribution of functions of sample average.

If the variance is small, then Xₙ is concentrated near its mean. Thus, what should matter for g(x) is the behavior near its mean μ. Hence we can expand g(x) near μ using the Taylor series as follows:

Equation 5. Taylor series approximation of a function of a random variable.

That calls for the following asymptotic behavior called First Order Delta Method:

First Order Delta Method

Let Xₙ be a sequence of random variables satisfying √n(Xₙ − μ) → N(0, σ²). If g’(μ) ≠0, then

which can be written following the Slutsky theorem I mentioned earlier.

Second Order Delta Method

If we add one more term to the Taylor series from Equation, we can have the second-order delta method which is useful when g’(μ) = 0 but when g’’(μ) ≠0.

where χ²₁ is the chi-square distribution of the degree of freedom equal to 1, introduced earlier.

Let’s do a little coding.

Consider a random normal sample with a mean of 1.5 and a true sample variance of 0.25. We are interested in the approximation of the variance of this sample multiplied by a constant c = 2.50. Mathematically, the new sample’s variance would be 0.25*(2.50²) = 1.5625 using the Delta method. Let’s do the sample empirically using R code:

c <- 2.50
trans_sample <- c*sample
var(trans_sample)

whose output is 1.563107, which is pretty close to one obtained using the Delta method.

In this article, I covered the Delta method which is an important topic for students taking Statistics classes but is generally overlooked by data science and machine learning practitioners. Delta methods are used in applications such as the variance of a product of survival probabilities, the variance of the estimate of reporting rate, the joint estimation of the variance of a parameter and the covariance of that parameter with another, and model averaging to name a few. I suggest readers look at reference materials to gain a further understanding of this topic.