In recent years, voice-based virtual assistants such as Google Assistant and Amazon Alexa have grown popular. This has presented both potential and challenges for natural language understanding (NLU) systems. These devices’ production systems are often trained by supervised learning and rely significantly on annotated data. But, data annotation is costly and time-consuming. Furthermore, model updates using offline supervised learning can take long and miss trending requests.
In the underlying architecture of voice-based virtual assistants, the NLU model often categorizes user requests into hypotheses for downstream applications to fulfill. A hypothesis comprises two tags: user intention (intent) and Named Entity Recognition (NER). For example, the valid hypothesis for “play a Madonna song” will be: PlaySong intent, ArtistName – Madonna.
A new Amazon research introduces deep reinforcement learning strategies for NLU ranking. Their work analyses a ranking question in an NLU system in which entirely independent domain experts generate hypotheses with their features, where a domain is a functional area such as Music, Shopping, or Weather. These hypotheses are then ranked based on their scores, calculated based on their characteristics. As a result, the ranker must calibrate features from domain experts and select one hypothesis according to policy.
They look at linear correlations between hypothesis characteristics and scores using the usual contextual multiarmed bandit technique. Then they use policy gradient algorithms and Q-learning algorithms to generalize to nonlinear relationships.
For hypotheses, they describe the problem as a contextual multiarmed bandit, with arms representing the scores derived by domain experts, known as domain rankers. Because the arms are independent, the features for hypotheses in a domain are not accessible by other domain rankers. Furthermore, each domain ranker produces scores for several hypotheses simultaneously. As a result, the setup necessitates a novel reinforcement learning approach with partially observed contexts and multi-action estimations.
The researchers use recent breakthroughs in deep reinforcement learning to provide an alternate online training technique. The sample return is used to estimate the value of the state-action pair in the policy gradient method.
The proposed method relies on implicit user feedback signals to calculate the rewards associated with samples in recorded data rather than human annotation. If the chosen NLU hypothesis meets the customer’s needs, a positive reward is given, and a negative reward is given if it does not. This approach can quickly catch up on trending requests because the training method allows for automatic model modifications during runtime.
The researchers use two sets of data to present the findings of a simulation experiment, as below:
- Annotated data, where the expected answer is the ground truth and is decided by human annotators
- Unannotated logged historical data, where the ground-truth is unknown, the expected response (considered ground-truth) is determined by implicit user input.
The IRER (exact-match hypothesis error rate) and ICER (intent classification error rate) are two metrics used by the researchers to explain their findings in these trials.
The results reveal that the novel technique reduces the exact-match hypothesis and intent classification error rates by 4% and 18%, respectively, compared to the baseline models. The team plans to employ user friction in a more sophisticated way by accounting for the uncertainty in friction estimation in the rewarding process. They also aim to improve the exploration process during online learning, for example, by utilizing look-ahead procedures.