Solving Multi-Armed Bandit Problems | by Hennie de Harder | Nov, 2022

By Jessie Hobb On Nov 4, 2022

A powerful and easy way to apply reinforcement learning.

Reinforcement learning is an interesting field which is growing and achieving great performance in gaming, robotics and at the financial market. A nice thing is that you don’t have to start with Deep Q-Learning agents! For some problems, it’s enough to implement a simple algorithm based on the principles of reinforcement learning. In this post, I will dive into multi-armed bandit problems and build a basic reinforcement learning program in Python.

Let’s start with an explanation of reinforcement learning. It works a bit like learning in real life: When a toddler cleans up his toys, his mother is happy, and she gives him a candy. But if a toddler hits his sister, his mother gets angry, and he gets punished. The toddler takes actions and receives rewards or punishment for his behavior.

You can compare the toddler with a reinforcement learning agent. In a reinforcement learning system, the agent interacts with the environment. The agent chooses an action and receives feedback from the environment in the form of states (or observations) and rewards. The rewards can be positive or negative, like the candy and the punishments. This cycle continues until the agent ends in a terminal state (the game is over). Then a new episode of learning starts. The goal of the agent is to maximize the sum of the rewards during an episode. Schematically, it looks like this:

Reinforcement Learning: An agent interacts with the environment by choosing actions and receiving observations (or states) and rewards. Image by author.

Imagine you want to go to a restaurant tonight with a good friend. It’s your turn to choose a restaurant. Of course, you want a restaurant with good food (and good service, but for simplicity we only look at the food). Do you choose your favorite restaurant, because you know you love the food there? Or do you choose a new restaurant, because the food in this new place might be even better? By choosing a new restaurant, you also have a higher risk of a negative reward (terrible food).

This is an example of a multi-armed bandit problem. At every timestep, you have to make a choice (a.k.a. action), and you can decide to exploit your experience and choose something you are familiar with. But you can also explore new options and broaden your experience (and knowledge).

Experiences with restaurants. At every timestep, you choose one restaurant. A green dot means a positive experience, while a red dot corresponds to a negative experience. Do you choose the restaurants with the highest percentage of positive experiences, or do you try a new restaurant? Image by author.

The name multi-armed bandit comes from slot machines in the casinos, where a one-armed bandit corresponds to one slot machine. So, for multiple slot machines, you want to know which lever to pull to get a maximum amount of money. You can decide to try a new slot machine, explore, or you can exploit your past experiences. Exploration and exploitation are important concepts for any reinforcement learning problem, more about that in the upcoming paragraphs.

A two-armed bandit? Photo by Allison Batley on Unsplash

Besides choosing restaurants or slot machines, there are other real-life applications where multi-armed bandits are useful. They have proven their value in e.g. recommendation systems and anomaly detection. Below are some practical examples:

A/B testing is common in online advertising. During an A/B test, different ads are shown to people. The one with a significant higher amount of clicks will be the most successful one, and the others are removed. A multi-armed bandit for selecting the best ad works as follows. The action is choosing one of the ads. In the beginning the probabilities of selecting the ads are equal (exploration phase), but will be changed in the favour of ads with a higher click through rate (exploitation phase). So a more successful ad will be shown more often and eventually the only one left. The advantage of this technique is that you can start exploiting your knowledge early in the process, like the chart below shows:

Three ads are tested to see which one performs best. The multi-armed bandit receives positive feedback from option A early in the process and starts exploiting his knowledge (starting at the second week). Image by author.

Warfarin is an anticoagulant medicine (it prevents blood clots). It’s hard to dose it correctly. The appropriate dose can vary a lot among individuals. It’s risky if the initial dose is incorrect, this can cause strokes or internal bleeding. A (contextual) multi-armed bandit helped here, by learning and assigning a proper initial dose to patients.

When dealing with recommending the most interesting products to users, multi-armed bandits can help. It’s a bit like A/B testing: you want to show users products they find interesting enough to buy, place in shopping chart or click on. If a user clicks on a product, this means it’s a good product to show to this user (or a positive reward). By applying this to similar customer groups, it’s possible to find the best items to show to a group.

There are many more applications. It’s even possible to use multi-armed bandits for better and/or faster machine learning! They are used in algorithm selection, hyperparameter optimization (as an improvement on Bayesian optimization) and clustering. This shows the versatility of the technique.

What are good ways to explore and exploit, when dealing with a multi-armed bandit problem? There are different techniques. First, let’s look at some of the key concepts of MABPs.

Action-Value Function & Regret

Let’s take a look at the slot machine example. If the probabilities of a reward of 1 for 3 machines are 0.16, 0.21 and 0.65, the best action to take is the machine with probability 0.65 (the highest).

The multi-armed bandit doesn’t know the probabilities, so first it will explore, until it discovers it will maximise the overall reward by only pulling the lever from slot machine 3 (probability 0.65). The value of an action is equal to the expected reward, which is equal to the probability of the action:

Value for an action is equal to the expected reward given that action, is equal to the probability of that action.

So the action-values of the 3 slot machines are 0.16, 0.21 and 0.65.

Regret is the cumulative reward the multi-armed bandit didn’t get, because of exploration. The optimal cumulative reward for the slot machine example for 100 rounds would be 0.65 * 100 = 65 (only choose the best machine). But during exploration, the multi-armed bandit also chooses the other machines. So the cumulative reward could be 0.65 * 60 + 0.16 * 15 + 0.21 * 25 = 46.65. In this case, the regret is equal to 65-46.65 = 18.35.

Multi-Armed Bandit Agents

The goal of multi-armed bandit agents is to maximise the total reward for an episode. There are different ways agents can explore and exploit, with varying degrees of success. Below, I will discuss some of them.

Random and optimal agents
A not-so-smart agent would choose every action at random. Not so smart, because it only explores and does not use the discovered knowledge.

The opposite would be an optimal agent, which chooses the optimal action every time. This isn’t realistic, because if we would know the probabilities of success for every bandit, we would always choose the highest one and there’s no need to implement an agent. It is a nice way to compare how well a (realistic) agent performs. All agents described below perform somewhere between the random and optimal agent.

Random and optimal agent. The random agent chooses a bandit at random, while an optimal agent only chooses the bandit with the highest success probability. The random agent gives a baseline. Image by author.

Epsilon agents
Another way to handle exploration and exploitation is by choosing an ‘exploration value’, epsilon. If we choose epsilon to be equal to 0.2, around 20% of the actions will be chosen at random (exploration). For the other 80% of the actions, the agent chooses the best action so far (exploitation). It does this by choosing the highest average reward of the actions. This agent is called epsilon greedy.

An improvement of the regular epsilon greedy agent is to explore more in the beginning, and less when time goes by. This is easy to accomplish, by adding a decay parameter, this gives us the epsilon greedy with decay agent. If this parameter is equal to 0.99, every iteration the value of epsilon decreases with the factor 0.99. This makes sure there is some exploration, but less and less when time passes, which makes sense because the agent gains knowledge and there’s less need to explore when his knowledge increases.

Because the epsilon agents described above are exploiting a lot, even in the beginning, there is another improvement possible. By setting the initial reward values of every action to the maximum reward possible (for example to 1 when the reward can be 0 or 1), the agent will prefer actions it hasn’t tried so much, because the agent thinks the reward is high. This agent is optimistic epsilon greedy (with decay). This ensures that exploiting a not-so-good action in the beginning is prevented.

Upper confidence bound agent
The upper confidence bound agent is an agent that knows how to handle exploration and exploitation. It uses the average reward of all the possible bandits for exploitation, just like the epsilon agents. An important difference, however, is how it handles exploration. This is done by adding a parameter, which uses the total time steps so far, and the number of times a certain bandit is chosen. For example, when we are at time step 100, and the agent chose an action only once, this parameter is high and makes sure the agent will choose this action. In this way, the upper confidence bound agent naturally explores actions it hasn’t explored often, while also exploiting his knowledge. Here is the formula for action selection for the upper confidence bound agent:

Action selection for the upper confidence bound agent. Image by author.

Bernoulli Thompson agent
The final agent uses a completely different approach compared to the previous agents, although the intuition feels the same. The Bernoulli Thompson agent becomes more confident about a reward for a bandit when the bandit has been chosen more often. For every bandit, the agent tracks the number of successes α and the number of failures β. It uses these values to create a probability model for every bandit. A beta distribution is used to sample a value from the distributions for all bandits, and the Thompson agent chooses the highest value of these sample values.

To understand the Bernoulli Thompson agent a bit better, let’s look at the following bandits:

The agent starts with α equal to 1 and β equal to 1 for all bandits, this equals a uniform Beta distribution (equal probability to be chosen):

After 5 trials, the distributions, actions and rewards look like this:

Machine 0 and 1 have the same distributions, they are both chosen twice with one success and one failure, while machine 2 is chosen once with 0 reward.

After 20 trials, machine 0 looks better than machine 1, if you only look at success percentage:

If we fast forward to 500 actions, the agent is pretty certain about machine one as being the most rewarding one, and the mean values of the distributions are almost equal to the true probabilities of success of the machines (0.25, 0.5 and 0.1):

Comparison of the agents

In the chart below, the agents are tested. This experiment can yield different results for a different number of trials and bandits. It’s clear that the optimistic epsilon agent with decay, the upper confidence bound agent and the Thompson agent outperform the others. This is exactly what we expect because they handle exploration smarter than the rest of the agents.

Results of an experiment with six bandits (besides the random and optimal agent), this is the average cumulative reward for 50 rounds with 2000 iterations. Click to enlarge. Image by author.

Python Code

You can find an implementation of the different multi-armed bandit agents on my Github. There is a test script (notebook file) that compares them and plots the results.

Downsides of a Multi-Armed Bandit

Basic multi-armed bandits want to choose between the same actions all the time. A multi-armed bandit can’t handle changing environments: If the probabilities of slot machines change (or your favourite restaurant gets a new cook), the multi-armed bandit needs to start learning from scratch (or you need to increase the explore parameter).

When you have more than one state, you should switch to contextual bandits. Contextual bandits try to find the best action based on the circumstances, so they can work with different states.

Multi-armed bandits are one of the most basic ways to apply reinforcement learning. They are widely used in different fields: finance, advertising, health and also to improve machine learning. Multi-armed bandit agents can explore and exploit in many different ways. They can use randomised techniques, like the random and epsilon agents. Or they can use smarter ways, like the optimistic epsilon agent, the upper confidence bound agent and the Bernoulli Thompson agent. In the end, it’s all about discovering the best action and maximising the overall reward.

Which agent would you choose as helper at the casino? 🤩