Following Reinforcement Learning Methods in Telecom Networks

This summary article is based on the research article by Ericsson 'Bringing reinforcement learning solutions to action in telecom networks'

Reinforcement learning (RL) has shown promise in creating complex logic in controlled settings. On the other hand, what are the prospects for using RL in a more complicated context like telecom networks? Let’s learn the basics first.

What is reinforcement learning, and how does it work?

In machine learning, the three methodologies are reinforcement learning (RL), supervised learning, and unsupervised learning. RL takes judgments and knows by repeated interactions with a target environment, whereas supervised and unsupervised learning train the model from a dataset.

The target environment for RL can be a dataset of previous interactions, a simulation of a system, or even a simple system. Even when the environment’s model is absent, the RL model learns complicated behaviors in the target environment and can make judgments while optimizing for a long-term goal.


Basic principles of reinforcement learning

The essential ideas of RL are depicted in the figure above. An agent uses its policy to determine action (At) based on the current state (St) at each time step (t). The environment carries the act, which returns the reward (Rt+1) and the following state. The agent modifies its policy to raise the accumulated reward using the tuple of (state, action, compensation, next state) acquired throughout previous stages.

While RL was not at the vanguard of the AI revolution in its early years, it is finally receiving attention in the industry, especially as the need for autonomous agents to manage complex systems grows. Many diverse industries, such as automotive, agricultural, and telecommunications, have begun to extensively invest in independent solutions, primarily employing RL, to optimize and automate their production processes.

Because the gaming environment was widely available and the agent could be educated in the game set at a low cost, the first efforts to effectively deploy RL mainly were for playing games. As the approaches progressed, we began to see real-world examples. For example, in web design and marketing, RL has been used to determine the content offered to the user creates the most value. It’s also being utilized in clinical trials to quickly assess different medications.

A team employed RL to educate an agent on how to sail by using a sailing simulator in various situations during America’s Cup sailing competition. Because of the agent, the requirement for sailors to participate in the boat design process was reduced, which was highly beneficial given their time constraints.

Coordination amongst such machines in dynamic contexts is typically complex, and RL agents can assist in achieving the requisite control time cycles while working with restricted resources.

Many obstacles must still be overcome before RL can be applied to complicated systems like mobile networks. There are now substantial measures to solve these issues, and the significant problems are discussed below.

Ultra-reliable, low latency, and high bandwidth network applications are possible with 5G, and they can have a wide range of uses and specifications. As the number of mobile communication devices grows, more control methods for autonomous operation are required. The management of a system includes everything from setting parameters for individual network operations to defining goals and restrictions for the mobile communication system to follow.

The ultimate goal is to create a zero-touch system that allows people to maintain control over and monitor automated operations with minimal interruption. As a result, engineers in the telecom industry want to deploy intelligent agents throughout the network, each of which may learn from data and collaborate with other agents. In combination with other applicable AI techniques, autonomous coordination of agents across network domains can get us closer to zero-touch management. In 5G networks, such agents may be heterogeneous in terms of competence, and they may even engage in dynamic settings where new agents enter the system, and current ones leave.

Radio systems are dynamically configured to meet operational needs. An autonomous design of antenna tilt and power in a commercial district, for example, might achieve the highest feasible coverage or quality based on current movement patterns.

By deploying interacting closed loops, closed-loop operations in 5G environments with dynamic orchestration across the different sub-network domains and orchestration layers of the network are possible.

Sim2real, latent space representations, safe RL, multi-agent RL, and offline RL are research methodologies. These can enable the development of RL systems that can be trained using offline data or a digital twin and then deployed in real-world scenarios with tolerable confidence.

Parallel to this, the deployment of interacting closed loops, allowed by the principles discussed above, will assist service providers in realizing enormous benefits by automating complicated systems while still achieving significant cost savings. With so many RL approaches being used in products, the telecom sector could soon be witnessing an autonomous network powered by intelligent agents.

Recent deployments of RL-based solutions have enhanced downlink user throughput by 12 percent in one operator’s network and reduced cell downlink transmission power by 20 percent in another operator’s network. Simulators and emulators were utilized for training the agents to develop optimal policies before they were transferred to a live network.


Each of these control loops represents a functionality with its own cycle time limits, and they can work alone or with other control loops. The figure above depicts the three domains of the telecom system: operations, administration, and maintenance (OAM), radio access network (RAN), and core network (Core).

The top-level OAM domain manages the network’s physical and virtual resources. The network operator optimizes resource consumption to meet predefined target service or business goals.

The Core’s job is to link users to the network reliably and securely. Connectivity, mobility management, authentication and authorization, subscriber data management, and policy administration are all part of the Core.

The RAN balances the operator’s assigned spectrum’s available radio resources against the user equipment’s (UE) requested services and geographic location. RAN contains features for medium access control, radio link control, radio resource control, and mobility management to help manage this effectively.

The diagram above depicts the overall structure of the significant network domains and their primary functions, but it does not show how they are implemented.

As the number of wireless networks grows, the system’s control methods must become increasingly complicated. Several resource control loops in the 5G RAN monitor current conditions, such as channel information, and control radio resources, such as link adaptation, MIMO beamforming, scheduling, and spectrum sharing.

In addition, dynamic UE environments and sites with overlapping coverage produce a lot of interdependencies. As a result, numerous KPIs must be optimized, yet episodes must be short enough to allow strict cycle periods. Furthermore, because the dynamics of the environment are constantly changing, there is a need to continually learn and adapt cost-effectively.

Compared to traditional domain expert design, data-driven control, such as RL, allows for the use of more sophisticated and real-time dynamic data.

RL is projected to play a critical role in 5G and future 6G networks, particularly in efficiently managing KPIs and expectations across the RAN, Core, and Orchestration/OAM domains. At Ericsson, RL techniques are being developed to help tackle complicated problems independently.

RL with several agents

Multi-Agent RL is a strategy in which several agents learn to simultaneously coordinate and act in the environment to achieve a common goal and/or individual targets. This technique is being investigated to resolve conflicts and promote collaboration among various agents in a telecom network, improving system performance and efficiency.

Non-stationarity is a problem in multi-agent RL. It happens when the next state seen is determined by the joint action of all agents instead by the action and condition of a single actor. The goal is to identify the small area of collaborative action space where collaboration and coordination are most effective. The WHIRL lab at Oxford, the Google Brain, and the DeepMind teams at Google, for example, has done some exciting research in this area.


Multiple control loops must run at the same time in telecom networks. There are two leading causes behind this. Numerous users with various service requests must be served simultaneously. Second, the delivery of a single service may necessitate the use of many control loops.

These control loops may need to work together in a hierarchical setting to provide the service.

What happens if they have a disagreement?

The most challenging task is to resolve such confrontations. The efficiency of a network will be determined in part by the efficiency with which numerous competing control loops can be organized with optimal resource allocation.

The aspect of credit assignment is crucial — to what extent is each agent responsible for the outcome? Because agents have various periods between their operations, joint action may need to be applied asynchronously. Benefits come in multiple scales, and the time window for attribution of rewards varies by agent, making it difficult to standardize the rewards.

In light of the issues above, state-of-the-art approaches like QMIX, COMA, and TD3 are being investigated and extended/modified specifically for OAM, Service Prioritization, and RAN challenges.


The figure above illustrates how we can study multiple activities simultaneously, such as transferring walking skills to learning how to trek or balancing on skis to maintaining balance on skates. So the question is: how can we transfer this human skill to algorithms, allowing models to learn and perform various tasks?

Because a single model is trained to perform numerous tasks, it requires more data to address task dimensions, which could make exploration costly. One way to deal with this problem is for the model to learn as much from a bad experience as it does from a good one, an approach known as Hindsight Experience Replay. Another method, learning from latent state representations, has proven to be particularly useful because it allows the model to understand the problem’s intrinsic structure.

A basic example of the problem is when an agent learns to travel from one maze coordinate to another. The starting point and destination aren’t set in stone. Rather than merely observing the start and finish points, the agent must understand the grid’s underlying structure through a latent representation of the state space and goal. A latent model could be produced using a simple function with the current and goal states as parameters or autoencoders requiring visual representation.

The transfer of experiences from one activity to another is also a barrier when using multi-task RL. When the tasks are highly dissimilar, it can lead to opposing gradients, which can negatively impact learning — this type of negative knowledge transfer must be avoided. Furthermore, because the algorithms learn numerous tasks from each other, there is a risk of ‘catastrophic forgetting,’ which occurs when an agent forgets what it learned previously while learning a new one.

There are various methods in the scientific literature that handle both of the issues listed above in the articles, ‘Robust Multitask Reinforcement Learning‘ and ‘Gradient Normalization for Adaptive Loss Balancing,’ but this is still a study area that needs to be explored further.

Often, learning a new activity does not imply starting from scratch but rather incorporating several previously taught skills into the recent move. Meta RL can also assist a model in adapting to a new task with a small amount of data by utilizing learned knowledge from previous jobs. Meta-learning is a solution for RL, which has been promising.

We’re working on a cloud-native, simulator-agnostic, cross-platform framework for large-scale distributed RL training that’s flexible and extendable.

The fundamental problem is that simulators/emulators are written in various programming languages and have varying messaging needs.

Safe RL Deploying RL agents in real-world telecom use cases requires a high safety level. How can we ensure or give safety? What kind of structure is needed? This goal includes safety criteria, safe RL algorithms, and the corresponding infrastructure. The concept of a safety shield sits between the natural world and the RL agents.

After evaluating the action’s safety using safety shield logic, the safety shield decides whether to perform the measure proposed by an RL agent. Two separate safety shield logics have been developed. One is based on symbolic logic extracted from domain knowledge, and the other is based on an existing rule-based solution whose safety has been proven. A safety shield creates an unsafe region in an action space, where the associated activities are disallowed in the real world.

When real-world data is scarce, sample efficiency is used to maximize RL performance. This is a regular occurrence in the telecom industry, where random probing is prohibited. There are a variety of strategies that can be used: Latent space training compresses the environment’s spatial and temporal representation. A model can kickstart by glancing at a few samples of trajectories from a base model’s previous learnings, known as experience transfer. Recent strategies for sample efficiency in RL training include few-shot learning, exploiting meta-learning, and neuromorphic computing.

Reinforcement learning has proven to be a feasible technology, with applications in games and less complex commercial settings. New methodologies and techniques at Ericsson Research are developed to overcome the issues of implementing reinforcement learning at scale in future networks. Collaboration with academia and open industry forums are essential elements of such an investigation.

While some in the AI community believe that the growth of reinforcement learning will lead to generalized intelligence, where Reward is Enough to motivate intelligent beings’ activities, we are still a long way off. The adventure will continue, and it promises to be thrilling.


[Announcing Gretel Navigator] Create, edit, and augment tabular data with the first compound AI system trusted by EY, Databricks, Google, and Microsoft