How ‘MAB’ (Multi-Armed Bandit), A Reinforcement Learning Algorithm, Helps To Solve Ad Optimization Problem

Digital Advertising agencies cater to billions of ads on various digital platforms. However, their primary concern remains the same- Which ad will be most appealing to the viewers? An efficient solution can be found in Reinforcement Learning models, a branch of Artificial Intelligence that has become well known for mastering board and video games. 

Conventionally, A/B testing is the standard technique for comparing the performance of two competing solutions (A and B). When there are more than two alternatives, it is called A/B/n testing. In A/B/n testing, the subjects are randomly divided into separate groups, and each is provided with one of the available solutions.

Digital ads usually have low conversion rates. The major problem with A/B/n testing is that it is not very efficient at finding minute differences between competing ads resulting in a revenue loss, especially when there is a more extensive ad catalog. Another problem that persists with classic A/B/n testing is that it is static.

Reinforcement Learning helps provide solutions to these problems. A reinforcement learning agent begins by knowing nothing about its environment’s actions, rewards, and penalties. Then the agent must find a way to maximize its rewards.

A “multi-armed bandit” (MAB) technique is used for ad optimization. It is a reinforcement learning algorithm that is suited for single-step reinforcement learning. In this situation, the reinforcement learning agent must find an efficient method to find the ad with the highest CTR without squandering too many ad impressions on inefficient ads. 

The reinforcement learning agent must also decide between choosing the best-performing ad and exploring other options. To achieve this, it uses the “epsilon-greedy” (ε-greedy) algorithm. The model will choose the best solution most of the time, and in a specified percent of cases (the epsilon factor), it will select one of the ads at random. 

An essential aspect of the ε-greedy reinforcement learning algorithm is adjusting the epsilon factor. If it is too low, it will exploit the ad, which it thinks is optimal at the expense of not finding a better solution. If the epsilon factor is too high, the RL agent will waste too many resources exploring non-optimal solutions.

One way to improve the epsilon-greedy algorithm is by defining a dynamic policy. When the MAB model is fresh, we can start with a high epsilon value. As the model serves more ads and gets a better estimate of each solution’s value, it can gradually reduce the epsilon value until it reaches a threshold value. Another way can be putting more weight on new observations and progressively decreasing the value of older observations. It is especially helpful in dynamic environments such as digital ads and product recommendations, where the value of solutions can change over time.

If we have two competing ads with an equal number of clicks and impressions, the ones whose clicks are more recent will be favored by the model. Also, suppose it comes across an ad with a very high CTR rate in the past but has become unresponsive in recent times. In that case, its value will decline faster in this model, forcing the RL model to move to other alternatives earlier and waste fewer resources on the inefficient ad.

This rich information available on the internet gives companies opportunities to personalize ads for each viewer. The multi-armed bandit model shows the same ad to everyone and doesn’t take each viewer’s specific characteristic into account. One solution to this issue can be creating several multi-armed bandits, but it will become difficult to train and maintain them. 

An alternative way is to use a “contextual bandit,” an upgraded version of the multi-armed bandit that takes contextual information into account. The contextual bandit uses “function approximation,” which tries to model each solution’s performance based on a set of input factors instead of creating a separate MAB for each combination of characteristics. It uses supervised machine learning to predict each ad’s performance based on location, device type, gender, age, etc.

Reinforcement learning techniques can solve various problems, like content and product recommendation, dynamic pricing. It can also be used in domains such as health care, investment, and network management.