Maximum Likelihood Estimation for Beginners (with R code) | by Jae Kim | Mar, 2023

By Jessie Hobb On Mar 22, 2023

Intuitive explanation of maximum likelihood method

The maximum likelihood principle is a fundamental method of estimation for a large number of models in data science, machine learning, and artificial intelligence. It is applicable to a range of methods from the logit model for classification to information theories in deep learning. This article is aimed to provide an intuitive introduction to the principle.

Suppose you have three data points x = (1, 2, 3), and you believe that they are generated from a normal distribution with the unknown mean (μ) and standard deviation equal to 1, i.e., N(μ,1).

Given these data points you have, what is the most likely value of μ?

This is the question that the method of maximum likelihood estimation is designed to answer.

Suppose the researcher considers three possible values of μ = (0, 2, 6) as likely candidates. Which one of these values are most likely for the data observed?

The figure above plots three normal distribution probability (density) functions for N(0,1), N(2,1), and N(6,1), respectively in blue, black, and green. That is, they are respectively f(X|μ=0), f(X|μ=2), f(X|μ=6), where

Created by the author

The red square dots at the bottom indicate x=(1, 2, 3), which are the observed data points.

It is obvious from the above plots that the data points x is most likely to have generated from N(2,1). They are quite unlikely from N(0,1), and even more so from N(6,1). Hence, we can say that the value of μ = 2 is associated with the highest likelihood or compatibility with x = (1, 2, 3).

If we consider all other possible values of μ, and convinced that 2 is most likely to have generated x, then it is the maximum likelihood estimate for μ.

Let us define some mathematical details:

f(X1,X2,X3|μ): joint probability density function of X, given μ. It shows the probability density of X, given the value of μ.

L(μ|x1,x2,x3): likelihood function of μ, given x. It shows the likelihood of μ, given the observed data x=(x1, x2, x3).

The difference is that the density function is indexed by random variable X, given a value of parameter such as μ; while the likelihood function is indexed by the parameter, given the observed data x.

The two functions are related as

L(μ|x1,x2,x3) = k f(x1,x2,x3|μ),

where k > 0 is any constant. Let us assume that k = 1 for simplicity. Then the two functions are almost the same, with the difference being their arguments and the conditioning values. If we assume for simplicity that X’s are independent, then we can write (since the joint probability is a product of individual probabilities under independence)

L(μ|x1,x2,x3) = f(x1|μ) × f(x2|μ) × f(x3|μ).

The above table shows the values of the likelihood function L(μ|x1,x2,x3) when x=(1, 2, 3): the values listed in the last column as a product of those in the first three. The highest likelihood value achieved is at μ = 2.

Now we consider all possible values of μ, and plot the likelihood and log-of-likelihood functions as a function of μ. The log-of-likelihood function is defined as

l(μ|x1,x2,x3) = log[L(μ|x1,x2,x3)],

where log() is the natural logarithmic function. The log-of-likelihood is a monotonic transformation of the likelihood function. It is widely used because it is analytically tractable, being additive and linear.

The functions are plotted as above. It is clear from the above plots that the likelihood or log-of-likelihood is maximized at μ = 2, which is the maximum likelihood estimate for x = (1, 2, 3).

Analytically, it can be shown that the sample mean is the maximum likelihood estimator for a sample generated independently from N(μ,1), and the sample mean of x = (1, 2, 3) is indeed 2.

R code for the calculations and plots are as below:

x = c(1,2,3) # Data observed
X=seq(-5,9,0.01) # X range
par(mfrow=c(1,1))
# plot density functions
plot(X,dnorm(X, mean=2, sd=1), type ="l",
col="black", lwd=2, add=TRUE, yaxt="n",ylab="density")
curve(dnorm(x, mean=0, sd=1), type ="l", col="blue",
lwd=2, add=TRUE, yaxt="n")
curve(dnorm(x, mean=6, sd=1), type ="l", col="green"
, lwd=2, add=TRUE, yaxt="n")
# points for data X
points(x,c(0,0,0),col="red",pch=15)
legend("topleft", legend=c("N(0,1)", "N(2,1)","N(6,1)"),
col=c("blue", "black","green"), lty=1,cex=1,lwd=2)
abline(v=c(0,2,6),col=c("blue","black","green"))# Calculation of Likelihood vales at different mean values
prod(dnorm(x,mean=0,sd=1))
prod(dnorm(x,mean=2,sd=1))
prod(dnorm(x,mean=6,sd=1))
# Plotting likelihood and log-of-likelihood 
m=seq(-4,8,0.1)
m1=rep(0,length(m))
m2=rep(0,length(m))
for(i in 1:length(m)) {
# Likelihood
m1[i]=prod(dnorm(x,mean=m[i],sd=1))
# log-of-Likelihood
m2[i]=sum(log(dnorm(x,mean=m[i],sd=1)))
}
# plotting
par(mfrow=c(1,2))
# Likelihood
plot(m,m1,type="l",ylab="L(mu|X)",xlab = "mu",lwd=2)
abline(v=2,col="red")
# Log-of-Likelihood
plot(m,m2,type="l",ylab="log of L(mu|X)",xlab = "mu",lwd=2)
abline(v=2,col="red")