Let Data Science Make Choices for You | by Mattia Cinelli | May, 2022

By Jessie Hobb On May 23, 2022

An introduction to multi-armed bandit

Imagine that you have to make multiple choices from a set of mutually exclusive options with uncertain outcomes. How do you maximize your gains?

Iterative processes, choices made over and over again, are not something abstract but actually, something we do daily: what do we eat? What route do we take to go home, given the time of day and traffic? How do we minimise the time spent on the road? What restaurant in my city should I go to?

These issues are addressed by a class of algorithms called Multi-Armed Bandits (MAB), a subfield of Reinforcement Learning (RL) algorithms, which aim to make better choices over time by performing actions without supervision.

Let’s review together, the most common and simplest algorithms, starting from concrete examples and working our way to a mathematical formalization. Before my recent interest in RL (in the last two years) I had little knowledge of it. During my PhD, the books I studied were focused on Supervised and Unsupervised learning, with reinforcement learning either briefly mentioned or discarded altogether. Only in recent years, thanks in large part to the innovation of Deep Mind, RL is becoming a more relevant and studied field and is personally of great interest to me. MAB algorithms are the first step in this fascinating field, and I think they will be of great interest to you personally and soon also professional.

In Reinforcement Learning we have an agent (A) that has to perform actions within an environment. By taking the actions the agent receives rewards or is punished by receiving penalties which, over time, ensures that the maximum reward is associated with the most optimal behaviour for an agent within a particular environment. There is no training data — the agent instead learns by trial and error in an iterative and interactive process.

One of the simplest problems in the field (and from which the example of the multi-armed bandit problem is named) is a slot machine. A price is paid to start and the player receives a reward by pulling one lever from a set of n options. The aim of the game is to choose the right levers and maximise the total reward, which can be achieved by using a MAB algorithm.

Let’s use the same concept of slot machine with a simpler example, to explain what kind of problem MAB algorithms try to solve.

Let’s assume we have four buttons to choose from, labelled A, B, C and D. Each button releases a different reward when pressed.

Let’s also assume that there is a limit on how many times we can press the four buttons, say, 100 times. And that each button gives the same reward every time: A always gives a reward of one, B, two, C, three and D four.

What is the strategy that maximizes our total reward?

The two simplest strategies we could choose are

1. Choose one button randomly and stick with that choice. That means that we only have a 25% chance of choosing D and maximising our reward, which would be even lower for more buttons

2. Press buttons randomly each time. Here the result will be an average of the results from each button times 100. Our results will always be the same if we repeat the experiment, but it will never be the best.

These two options are on opposing sides of the spectrum in what is called the exploitation-exploration dilemma. With the first, we have made only one decision and stuck with that (only exploited). With the other, we have only explored and not used insight gained by making other choices.

It is clear that we have to strike a balance between exploring what the best choice is and then exploiting the best one.

Here are four algorithms that try to achieve the exploitation-exploration balance:

ε-greedy algorithm
Optimistic Initial Values
Gradient bandit algorithm
Upper Confidence Bound (UCB) action selection

Continuing from the example above, we have 4 buttons with fixed rewards.

In this case, the strategy is simple: try each button once and keep selecting the one with the highest reward until the end.

Four single acts of exploration and 96 of exploitation. For a total return of 1+2+3+4*97=394, against the highest possible score of 400. The difference between the highest theoretical score and the return is called regret.

Let’s say that the reward is not fixed, but each button is associated with a reward that is normally distributed. How would you solve that while maximizing the reward?

The ε-greedy algorithm adapts this solution: to press the button with the highest reward with probability and select a random button with probability 1-ε. While keeping track of the results and computing the average reward for each choice.

1. Set the average value for each button to zero
2. Press each button once and record its value.
3. Continue until the end:
a. Select the button with the highest average (being greedy) value with probability , otherwise any random button (note, the button with the highest value could even be pressed in this phase).
b.Update the mean associated with each button.
c. Update the total return

With this method, the system will quickly know which distribution offers the highest average rewards and it will choose it for ε+1-ε/n times.

The average reward connected with each action is called the action-value.

It is also possible to have a flexible that increases over time and therefore it chooses only the best option, stopping any process of exploration.

For example for 90%, the best button of our example is chosen 90+100–90/4=92.5% after a while, this could be increased to 100%.

Naturally, this strategy makes sense only if the distributions do not change over time, because if they do, the best choice could change.

In the ε-greedy algorithm, we see that the first value from each button is considered the first average value and it is typically set to zero.

The initial zero values strategy carries risks of bias for the entire process. We saw that the ε-greedy algorithm starts choosing the highest-average value immediately after the first cycle of exploration. But what would happen if one distribution has a large standard deviation, and its first values are very far from the true mean?

In that case, it would take a while for the system to adjust, even harder if the true mean of this hypothetical distribution with a large SD is close to the highest averaged distribution.

The system might not find the best option before the maximum amount of time we can press the buttons is reached. This limit to a process is called an episode.

To solve this problem, we can set the initial values to a value that is way bigger than our rewards, such value is called an optimistic value because we set the initial values as way higher than the true bandits’ action-values.

In this plot, we can see an example of the optimist algorithm vs the ε-greedy. The latter start exploiting the best action it found while the optimistic spends more time in an exploration phase gaming a better insight on the optimal action[Image by [1]]

Whichever actions are initially selected the reward is less than our optimistic value, so the resulting average will be less than the other values. Therefore, all actions are tried several times before the value estimates converge and the method has to do a longer amount of exploration because the action with the highest average values changes quite often and takes a long time to choose an action with a consistently higher value.

As we mentioned, any ε-greedy algorithm takes the best action at that present time step, only considering the highest average at that point ( the action-value estimates). However, it does not consider how uncertain the estimate is.

In the trade-off between exploitation and exploration, the greedy algorithm gives a way of choosing a non-greedy action, but its choice is random among greedy and non-greedy actions.

The Upper Confidence Bound action selection (UCB) instead tries to evaluate the actions that do not have the highest value (non-greedy) in relation to their potential for being the true optimal action (Optimistic in the face of Uncertainly).his is carried out by the following formula, which considers both the action-value estimates and the uncertainties of those estimates:

**Upper-Confidence-Bound Action Selection** formula. Max argument of a for Qt(a), plus the hyper-parameter c, the degree of exploration time the square root of the natural logarithm of t divided by the number of times an action has been chosen before time t. Qt(a) represents the ***exploitation*** part of the equation, the action chosen will be the one with the highest estimated reward. The second part of the equation represents the **exploration** side, this provides a measure of uncertainty for action-value. [Image by author]

Every time an action is taken, the action-value estimate becomes more accurate (Qt(a) closer to the true value). If Nt(a) increases, the probability of being selected for exploration purposes decreases, although it could still be selected for exploitation. The value for the left-hand side of the equation, for the actions not selected, grows slowly until one of them will trigger another explorative action.

UCB can judge when to explore and when to exploit, and it is quite fast in founding the optimal action. For this reason, UCB is one of the most valid algorithms, as we will see in the experiment sections.

So far, we have seen methods that estimate the action to take based on action values. There are other different approaches possible, one of that is to use the Soft-Max Distribution (i.e., Gibbs or Boltzmann distribution) as follows:

*Gradient Bandit formula (soft-max) more info* *here*. [Image by author]

The Soft-max functions calculate the probability distribution of all actions, and we can choose the one with the best score.

You might have already encountered the concept of soft-max distribution in the field of neural networks as it is often used when the output layer has one node per class.

To explain the maths behind GBA would take probably another article longer than this one. I have still chosen to insert it here because I wanted to show that using the average reward value as action-value is not the only option possible.

Here we have seen the most popular algorithms for multi-armed bandit problems that explore different approaches to the exploration vs. exploitation dilemma:

Only exploring, which chooses a random action every time step
Only exploiting, which selects a random action the first time and repeats it
The ε-greedy methods, which choose randomly for a small fraction of the time or otherwise explore
Optimistic, which forces the system to have a longer exploration initial phase.
UCB chooses deterministically but achieves exploration by favouring actions that have received fewer samples.
GBA is an entirely different approach based on probability distribution.

As you probably figured out, there is no method that works best for every possible scenario. UCB is probably the most widely used, but this is not guaranteed.

However, I thought it would be best to show how these algorithms will behave with the first example I gave I’ve run 250 experiments for 75-time steps and this is the result.

*I have chosen the value of 0.75 for epsilon as a good middle point for each method. C for UCB is 2, the optimistic initial value is 5. Standard Deviation is 1.* [Image by author]

GBA is able to find the optimal action only 25% of the time — this is a bit surprising, so maybe it would be a poor algorithm to use in this particular scenario.

The best performing tool is the UCB which finds the optimal action in a short amount of and follows it to the end of the episode. The Optimistic algorithm takes advantage of the early exploration phase and finds t optimal action, whereas the greedy algorithm needs more time steps.

Only exploration and exploitation have 25% of optimal actions as predicted.

The average reward per time step for 250 experiments. [Image by author]

Similarly, if we look at the average reward per time step, UCB gets stable with the highest reward. Optimistic has a learning curve at the beginning and Greedy lags behind.

MAB and these four algorithms are just the beginning of a long journey in the study of Reinforcement Learning.

And because this is a complex sub-field of RL I have decided to create a python implementation of the most important algorithms, which you can find here: https://github.com/MattiaCinelli/AlgoRL

In this article, I have not mentioned Thompson’s Sampling (for Bernoulli or Gaussian distribution) or Bayesian approaches, which can be found in the code and I encourage you to try them out.

You are all welcome to participate and improve the code and if you have questions don’t hesitate to reach out.

Reinforcement Learning: An Introduction. Sutton and Barto. 2nd Edition.
Grokking Deep Reinforcement Learning by Miguel Morales.