The researchers at the University of Sheffield, Beihang University, and Open University’s Knowledge Media Institute have introduced historical text summarization task, where documents in historical forms of a language are summarised in the corresponding modern language.
The process of text summarization is a fundamentally important routine to historians and digital humanities researchers. Historical text summarization is regarded as a particular case of cross-lingual summarization. However, summarizing and interpreting historical documents can cost a lot of time and effort, even for experts. This is due to the limited historical and modern language corpora and cultural and linguistic variations over time.
To overcome these challenges, the researchers have proposed a transfer learning approach that automates historical texts at a semantic level and generates summaries in modern language. The team used the German and Chinese languages to build the model because both languages have a rich textual heritage and accessible (monolingual) training resources. Additionally, these languages have distinguished alphabetic and ideographic writing systems, respectively, facilitating future applications of the method in other languages.
The team state that their model is based on a cross-lingual transfer learning framework introduced in the paper ‘A Survey of Cross-lingual Word Embedding Models’ in 2019. The proposed transfer-learning-based approach can be bootstrapped even without cross-lingual supervision, thus tackling the issue of limited resource availability to train the model. As this is the first study to automate historical text summarization, there are no directly relevant methods available to compare the proposed method.
The researchers implement two state-of-the-art baselines for standard cross-lingual summarisation and conduct comprehensive automatic and human evaluations using the standard ROUGE metric. The results demonstrate that the method outperforms standard cross-lingual benchmarks on the task.
The proposed model does not require parallel supervision and provides a validated high-performing baseline for future studies. Further, the team aims to improve the model to address zero-shot patterns and language change challenges. Additionally, they plan to add languages such as English and Greek and increase the dataset’s size in each language.
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.