Google AI’s Latest Research on Language Model Proposes Two Different Sequence-To-Sequence Approaches Toward Zero-Shot Transfer For Dialogue Modeling

This article summary is based on the research papers: 'Description-Driven Task-Oriented Dialog Modeling' and 'Show, Don’t Tell: Demonstrations Outperform Descriptions for Schema-Guided Task-Oriented Dialogue'. All credit goes to the research team of these papers.

Please don't forget to join our ML Subreddit

Need help in creating ML Research content for your lab/startup? Talk to us

Conversational agents are nothing but computer systems intended to converse with humans. These agents have a wide range of applications, including booking flights, finding restaurants, playing music, and telling jokes. It can analyze and reply in ordinary natural languages. However, adding this functionality might be cumbersome as each new assignment necessitates new data collection and retraining of the conversational agent’s models.

Most task-oriented dialogue (TOD) models are trained on a particular task-specific ontology, which poses a significant problem. Ontology is just a catalog of conceivable user intentions or the activities that the conversational agent must complete and the available attribute fields to retrieve from any given interaction. A tight ontology can limit the model’s ability to generalize to new tasks or contexts. Consider the case where an agent already knows how to buy train tickets; adding the ability to order airline tickets would necessitate new data training. In an ideal world, the agent would be able to apply existing information from one ontology to new ones.

By condensing each conceptual framework into a schema of slots and intents, the Schema Guided Dialogue (SGD) dataset has been constructed to examine the ability to generalize to unseen problems. TOD models are trained on several schemas in the SGD environment and assessed on how effectively they generalize to new ones rather than how effectively they overfit to a particular ontology. Recent research, however, indicates that the top models still have space for development.

Google AI has devised two different sequence-to-sequence techniques to zero-shot transfer for speech modeling to overcome this issue. The findings of several dialogue state monitoring standards reveal that by removing preset schemas and ontologies, these innovative techniques produce state-of-the-art outcomes with more efficient models for the dialogue state tracking job.

What is dialogue state tracking?

DST is a core flaw for conversational agents, where a model predicts a conversation’s belief state. The belief state is usually represented as a set of values assigned to slots where the user has expressed a preference throughout the conversation. The agent expresses its opinion on the user’s preferences.

  • In one of their papers, Google AI presents Dialogue State Tracking Based on Descriptions (D3ST). While making assumptions about the belief state, this DST model uses slot and intent descriptions. The T5 sequence-to-sequence language paradigm provides the foundation for D3ST.
  • The T5 model can attend to both the contextual information and the discourse since D3ST triggers the input data with slots and intent descriptions. The articulation of these descriptions gives it the power to generalize. Instead of assigning each slot a name, the aim is to assign each slot a random index. The belief state and user intent are the goal outputs that are again characterized by their assigned indices. Intents are treated similarly, and these descriptions are combined to generate the schema representation included in the input sequence. This is supplied into the T5 model along with the discussion transcript.
  • Instead of depending on slot definitions, one of their other papers recommends using a single annotated dialogue sample to explain the various slots and values in a conversation. The purpose is to “display” rather than “explain” the schema’s semantics through descriptions. Hence the term “Show Don’t Tell” (SDT). SDT is based on T5, and it outperforms D3ST in terms of zero-shot performance.

Both models meet or surpass current framework baselines (T5DST and paDST). In general, SDT outperforms D3ST by a small margin. Furthermore, while baseline models can only predict one slot per forwarding pass, the suggested models can decode the entire dialogue state in a single forward pass, making training and inference substantially more efficient.

In the MultiWOZ cross-domain leave-one-out configuration, both D3ST and SDT show state-of-the-art performance. Additionally, the SGD-X dataset was used to test D3ST and SDT, and both showed strong resistance to linguistic differences in the schema. These tests show that D3ST and SDT are cutting-edge TOD models with the ability to represent new tasks and domains.


The articles presented by Google AI show that a zero-shot TOD system can generalize to previously unknown tasks or domains. On the other hand, the research team has focused on the DST issue. They hope to expand this study to include zero-shot conversation policy modeling, allowing TOD systems to respond to arbitrary instructions. Furthermore, the existing input format frequently results in large input sequences, which might make inference slow.

Paper 1:

Paper 2:


🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...