Google AI Introduces Two New Datasets, ‘TimeDial’ and ‘Disfl-QA’, For Conversational NLP (Natural Language Processing)

Natural language processing (NLP) has made significant advancements in recent years, with applications in learning, comprehending, and generating human language content. However, one of the greatest challenges in NLP is designing conversational bots that can understand and reason about distinct linguistic phenomena specific to natural speech.

People do not always plan out precisely what they will say, and disfluencies, or interruptions in speech, are common in spontaneous conversations. Simple disfluencies (such as interjections, repetitions, restarts, or corrections) interrupt the flow of a sentence, while more complicated semantic disfluencies modify the underlying meaning of a phrase. Furthermore, understanding a conversation frequently requires an awareness of temporal linkages and relationships between events, such as whether one incident precedes or follows another. 

However, conversational agents constructed on today’s NLP models often fail when presented with temporal linkages or disfluencies. Many studies have attempted to increase conversational agent performance, but progress has been slow. This is partly owing to a scarcity of datasets involving such fascinating conversational and speech occurrences. 

To address these issues, a new Google research introduces TimeDial and Disfl-QA. TimeDial focuses on temporal commonsense reasoning in dialogue, and Disfl-QA focuses on contextual disfluencies. These are the first benchmark datasets of their kind, and they reveal a significant difference between human performance and current state-of-the-art NLP algorithms.


With an annotated test set of over 1.1k dialogues, TimeDial introduces a novel multiple choice span filling task for temporal knowledge. This dataset is derived from the DailyDialog multi-turn dialogue corpus and measures models’ temporal commonsense reasoning abilities within a dialogue context. Each dialogue in the dataset is presented in a multiple-choice format, with one temporal span masked off. The model is asked to discover all valid responses from a set of four alternatives to fill in the gap.

The team test three different modelling paradigms on TimeDial dataset :

  1. BERT classification across the four options presented.
  2. BERT-MLM mask filling for the masked span in the dialogue.
  3. T5 generative approaches.

The results show that all models struggle with this task, with the top variant getting only 73 per cent.


Qualitative error assessments reveal that pre-trained language models frequently rely on superficial, false features (especially text matching) rather than properly reasoning over the context. Building NLP models capable of executing the temporal commonsense reasoning required by TimeDial is likely to necessitate a reconsideration of how temporal objects are represented within generic text representations.


Disfl-QA is the first dataset to include contextual disfluencies in an information-seeking environment with over 12k human-annotated disfluent questions. Corrections or restarts account for almost 90% of the disfluencies in Disfl-QA, making it a significantly more demanding test set for disfluency correction. Furthermore, compared to previous disfluency datasets, it contains a greater variety of semantic distractors, i.e., distractors with semantic meaning rather than mere speech disfluencies.

Experiments reveal that when evaluated on Disfl-QA and heuristic disfluencies in a zero-shot setting, the performance of existing state-of-the-art language model-based question answering systems falls dramatically.


The researchers state that data augmentation strategies can partially compensate for performance loss and that employing human-annotated training data for fine-tuning is effective. They assert that for NLP models to be robust to disfluencies, researchers need large-scale disfluency datasets.

TimeDial and Disfl-QA datasets will allow researchers to evaluate their robustness to ubiquitous phenomena across different tasks. The team hopes that future studies will work on devising generalised few-shot or zero-shot approaches to effectively handle phenomena such as disfluencies and temporal reasoning without the need for task-specific human-annotated training datasets.


Github (TIMEDIAL):

Paper (Disfl-QA):

Github (Disfl-QA):


🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...