Full Explanation of MLE, MAP and Bayesian Inference | by Oliver S | Mar, 2023

By Jessie Hobb On Mar 7, 2023

Introducing maximum likelihood estimation, maximum a posteriori estimation and Bayesian Inference

In this post we will introduce the concepts MLE (maximum likelihood estimation), MAP (maximum a posteriori estimation) and Bayesian inference — which are fundamental to statistics, data science and machine learning, to name just a few fields. We will explain each method using the same example of an unfair coin toss, derive results analytically and numerically (for Bayesian inference) and show differences.

We will learn that MLE maximises the likelihood — i.e. chooses parameters which maximise the likelihood of the observed data. MAP adds a prior, inducing prior knowledge over the parameters — thus bridging the gap from a purely Frequentist concept to a Bayesian one (link). Bayesian inference eventually gives us the most information, but also is the hardest to execute: it involves modelling the full posterior distribution of the parameter given the data — as opposed to the previous methods just yielding point estimates.

Let’s set the stage by formalising these descriptions: essentially, for any kind of learning problem we want to find a model / parameters which describe the observed data as well as possible. We can fully describe and solve this by finding a mapping / a distribution from observed data to parameters:

Usually, these terms are named as:

The conditional distribution of the parameters given the data is known as posterior distribution — this is the term we are interested in. Likelihood is the conditional distribution of the data given a certain parameter value. This — as the name suggests — describes how likely the observed data given the chosen parameter value. The prior distribution describes our belief over the parameters — how do we expect them to look like before observing any data. Finally, the evidence is a normalisation constant making the posterior a “true” probability distribution. It describes the given data fully and , usually , is hard to compute or even intractable — one of the main reasons why Bayesian inference often is hard.

With these insights and formulas / notation, we will introduce MLE / MAP / Bayesian inference in the following. But first, let us introduce the problem we will use for demonstration throughout.

To showcase these concepts, we will use the example of throwing an unfair coin: assume, we are given a coin which lands heads with probability θ, and tails with probability 1 — θ. θ is unknown to us, and we would like to estimate it after experimenting with the coin. That is, we assume we have or collect a dataset of coin tosses, and then want to find an estimate for θ using MLE, MAP and Bayesian Inference.

Before moving on, let us describe a quantity we’ll need for all following sections: the likelihood. Throwing a coin follows a Bernoulli distribution. Assume we throw the coin N times, then denote the number of heads with Nₕ and the number of tails with Nₜ.

One core assumption for this (and the following) is independence: we assume, all draws of the coin are independent and identically distributed (i.i.d) — allowing us to write the likelihood as a product of individual likelihoods. In particular, we obtain:

Plugging in the Bernoulli distribution yields:

In this, we model single events (coin tosses) with a binary variable of value 1 for heads, and 0 for tails.

Since the logarithm is monotonic and it is often easier to work with sums, we apply this and reformulate:

Dataset

Before continuing, let’s define a sample dataset we observed by throwing this unfair coin — which we’ll use throughout the next sections do demonstrate results of the different methods.

We will assume we have flipped an unfair coin with true θ = 0.3 100 times, and observed heads 36 times, and tails 64 times.

Now, let’s come to our first concept of interest: MLE. In this, we want to find that set of parameters, which maximises the likelihood of the observed data. For this, we form the likelihood function, form the derivative w.r.t. the parameters and solve for the root. Note that, even if you might not have known the name before, it is very likely you applied this concept before — this actually is the core of most ML algorithms. Think of a neural network to solve a regression or classification problem: L2-loss or cross-entropy can be interpreted as likelihood, and gradient descent then finds the optimum.

Thus — let’s take above likelihood formula, and find that θ maximising it — i.e. form the derivative and set it to 0:

Solving for 0 and some smaller reformulations give us the MLE estimate for θ:

This should also make sense intuitively: the parameter θ, the probability of the coin landing heads, should — without any other prior information or constraints — equal the ratio of observed head results and total number of throws.

Let’s apply this to our dataset introduced above. We obtain:

MLE represents maximising the likelihood —finding the set of parameter values which explain the data best, without any prior information or further constraints. Now, we turn to MAP — which entails introducing a prior over the parameters — and is actually equivalent to finding the maximum of the posterior distribution.

MLE and MAP are good examples for the discussions between Frequentist and Bayesian concepts. While Frequentists view probabilities simply as the results of observing random, repeatable events — the Bayesian view models everything, including parameters, as random variables — which can be updated given new evidence. While Bayesian concepts (in my opinion) usually give more powerful solutions, one core Frequentist criticism is their need of priors — which have to come from somewhere, and can be wrong.

Further note, that adding regularisation to ML models actually can be shown to equal MAP (in addition to above statement, that models without regularisation represent MLE concepts). But this interesting realisation would lead too far here, and we will dedicate a future post to it.

Let’s revisit the introductory formula:

Since p(x) is a constant, and independent of θ, when forming the derivative w.r.t. θ and solving for 0, it disappears. Thus, MAP indeed gives us the point estimate for θ maximising the posterior distribution. All we need, is a prior. We choose a beta distribution, whose pdf for a given θ is characterised by:

Why we chose a beta distribution we will discuss in the next section (hint: conjugate priors).

For simplicity we again apply the logarithm, resulting in the following problem:

Thus, when forming the derivative and solving for 0, we can consider each summand separately — and the left one we already calculated above to be:

Let us now look at the right one:

Forming the derivative yields:

We can now combine these two terms, and solve for 0:

Some reformulations return the map estimate for θ:

To apply this formula, we first have to come up with our prior. Let’s say we have vaguely heard (e.g. from the coin manufacturer) that the true heads probability should be around 0.3, but we are not really sure.

To model this belief, let’s pick a beta distribution with α = 4 and β = 10. We can plot this distribution via:

import matplotlib.pyplot as plt
from scipy.stats import betax = np.linspace(0, 1, 100)
plt.plot(x, beta.pdf(x, 4, 10))
plt.show()

Resulting in the following plot:

Using this prior, we obtain:

Now, let’s come to full Bayesian inference and solve for the full posterior distribution instead of a point estimate for the parameters: i.e., now we will obtain a probability distribution over the parameter(s) conditioned on the observed data.

Let’s first examine the nominator:

Plugging in the previously introduced distributions and results we obtain:

This gives rise to the above teased notion of conjugate priors: the prior is called conjugate prior for the likelihood, if the resulting distribution is in the same family as the prior — which also translates to the posterior being in the same family, which we’ll see in a bit.

Let’s now come to the evidence, the denumerator. This usually is the tricky part and often is intractable, as it entails marginalising over all possible parameter values:

This integral usually is hard to compute, or even intractable— and one of the main reasons (exact) Bayesian inference is hard, or intractable. However, as we will see later on, usually we don’t even try to solve this analytically, but resort to various approximation techniques, which also yield satisfactory results.

In our case, there exists an analytical, closed-form solution though. Looking at the integral, we spot the inside of it being the same as our above calculation:

Following the definition of the Beta function, which we cheekily ignored above, we observe that this is actually the Beta function evaluated at Nₕ+α-1, N-Nₕ-1:

Now we can put nominator and demoninator together to obtain a closed-form version for the posterior:

Let’s see how this distribution looks like, keeping our beta prior from before with α = 4 and β = 10:

Making use of existing knowledge about the beta distribution, the mean of a random variable X following a beta distribution with parameters α and β is given by:

Which, in our case evaluates to the same result as obtained by MAP, namely 0.348.

For the variance we obtain:

In this section we’ll discuss and compare the results obtained in the previous sections.

MLE estimates θ, the probability of the coin landing heads to be 0.36 — which is exactly the relative frequency of observed head tosses (36 / 100). This makes sense. Without any other information, prior knowledge, … the parameter best explaining the data should be that one reflecting it — which is 0.36 in our case.

Our MAP estimate is 0.348, which is closer to the true value of 0.3, and closer towards the mode of the prior. The latter point here is the one causing this: since our prior is centred around 0.3 with a relatively small variance, this is reflected in the final result.

To see this effect, consider a prior with a higher variance, e.g. given by α = 2 and β = 3:

In this case, our MAP estimate becomes 0.359 — which is closer to the MLE value, as it’s less affected by the prior.

Bayesian Inference returns a full posterior distribution. Its mode is 0.348 — i.e. the same as the MAP estimate. This is expected, as MAP is simply the point estimate solution for the posterior distribution. However, having the full posterior distribution gives us much more insights into the problem — which we’ll cover two sections down.

For this example, we managed to find a closed-form for the posterior and thus solve the Bayesian inference problem analytically. However, as stated before, this usually is hard or even infeasible. Luckily, in practise we don’t need to, or often don’t even try to solve this problem analytically — but instead resort to some sort of approximation.

There are several possibilities for this, here, we employ numerical approximation using MCMC methods. In the linked post I’m introducing this at length, so I would kindly refer there for more details. Here, we just briefly summarise the core concepts: MCMC methods work by defining a Markov chain which is relatively simple to sample from, and whose stationary distribution is the target distribution. We then follow this Markov chain, generating N dependent data samples — which, due to the stationary property, equals sampling from the target distribution.

Here we now want to apply this principle, in particular the Metropolis-Hastings algorithm, to approximate the posterior distribution we solved analytically above.

We can do this with the following code snippet:

import matplotlib.pyplot as plt
import numpy as np
import scipy.statsTHETA_TRUE = 0.3  # True probability for landing heads
# Parameters defining the beta prior distribution
ALPHA_PRIOR = 4
BETA_PRIOR = 10
NUM_SAMPLES = 100000  # Number of MCMC steps
# Fake a dataset which equals the one assumed in the previous sections
D = np.asarray([1] * 36 + [0] * 64)
# Define prior distribution
prior = scipy.stats.beta(ALPHA_PRIOR, BETA_PRIOR)
def likelihood_ratio(theta_1, theta_2):
return (theta_1 / theta_2) ** np.sum(D == 1) * (
(1 - theta_1) / (1 - theta_2)
) ** np.sum(D == 0)
def norm_ratio(theta_1, theta_2):
return prior.pdf(theta_1) / prior.pdf(theta_2)
# Step 1
x = np.random.uniform(0, 1)
# Proposal distribution
q = scipy.stats.norm(0, 0.1)
samples = []
for i in range(NUM_SAMPLES):
# Step 2
y = x + q.rvs()
# Step 3
ratio = likelihood_ratio(y, x) * norm_ratio(y, x)
p = min(ratio * q.pdf(x - y) / q.pdf(y - x), 1)
# Step 4
u = np.random.uniform(0, 1)
# Step 5
x = y if u <= p and 0 <= y <= 1 else x
samples.append(x)
# Plot the sampled posterior distribution
plt.hist(samples, density=True, bins=100)
# Plot the posterior distribution obtained by the analytical solution
x_values = np.linspace(0, 1, 100)
plt.plot(x_values, scipy.stats.beta.pdf(x_values, 36 + ALPHA_PRIOR, 100 - 36 + BETA_PRIOR))
plt.show()

The last two lines plot the sampled posterior distribution (histogram) vs the calculated posterior (line) — and we observe the expected overlap:

Having come this far, you might ask the question: all methods shown so far give me the same (or very similar) mode of the posterior distribution — leading me to always pick the same parameter in the end. Why bother then to do this complicated Bayesian inference?

In this final section we will answer this question. We’ll first explain what advantages this brings, and finish up with a practical example showcasing this.

To start: doing Bayesian inference gives us much more information. We not only get the mode of the posterior distribution, but the full distribution. We can thus inspect it, calculate other moments (such as the variance) and overall get a much better understanding of the problem. In particular, through this we get a sense of uncertainty, and can also decide to reject our hypothesis explaining the data. Let’s demonstrate this on an example.

Assume we change our coin analysis into a “game”: we are given two coins, and have to pick that one which is fair, i.e. lands 50:50.

The game show hosts presents you with two coins:

Coin 1 was thrown 8 times and landed heads in 4 of them.
Coin 2 was thrown 100 times and landed heads 50 times of these.

On a first glance, both seem to land heads with around 50% probability. However, intuitively most people would surely pick coin 2 — as here we have a much larger sample size. Clever as you are, you quickly apply the maximum a posteriori method in your head. You pick a beta distribution with α = β = 2, giving you a nicely spread symmetric distribution over [0, 1] with mode 0.5.

Let’s do the math and calculate the MAP estimates for θ:

θ₁ = (4+2-1)/(8+2+2–2)=0.5
θ₂= (50+2–1)/(100+2+2–2)=0.5

Thus, according to the MAP method, both coins yield exactly the same result!

Let’s plot the full posterior distribution for both, as obtaind by Bayesian inference above:

Now, following our expectations, the posterior for Coin 2 has a much lower variance — justifying that we should pick this!

In this article, we introduced MLE (maximum likelihood estimation), MAP (maximum a posteriori estimation) and Bayesian inference. We used the example of an unfair coin throughout for demonstration.

MLE finds parameters which optimize the likelihood. MAP introduces a new prior over the parameters, returning parameters which maximize the full posterior. Thus, both MLE and MAP return point estimates. In contrast, Bayesian inference models the full posterior distribution. This usually is a complex, often even intractable task — but also more powerful, as we get more insights into this distribution, such as the variance.

This brings us to the end of this article. Thanks for reading!

Notes:

All images, unless stated otherwise, were generated by the author.
Examples are calculations throughout this post were partially motivated by this great tutorial.