Researchers from Datategy and Math & AI Institute Offer a Perspective for the Future of Multi-Modality of Large Language Models

Researchers from Datategy SAS in France and Math & AI Institute in Turkey propose one potential direction for the recently emerging multi-modal architectures. The central idea of their study is that well-studied Named Entity Recognition (NER) formulation can be incorporated into a many-modal Large Language Model (LLM) setting. 

Multimodal architectures such as LLaVA, Kosmos, or AnyMAL have been gaining traction recently and have demonstrated their capabilities in practice. These models tokenize data from modalities other than text, such as images, and use external modality-specific encoders to embed them into joint linguistic space. This allows architectures to provide a means to instruct tune multi-modal data mixed with the text in an interleaved fashion. 

Authors of this paper propose that this generic architectural preference can be extended into a much more ambitious setting in the near future, which they refer to as an “omni-modal era”. Notions of “entities”, which are somehow connected to the concept of NER, can be imagined as modalities for these types of architectures. 

For instance, current LLMs are known to struggle to deduce full algebraic reasoning. Though research is going on to develop “math-friendly” specific models or use external tools, one particular horizon for this problem might be to define quantitative values as a modality in this framework. Another example would be implicit and explicit date and time entities which can be processed by a specific temporally-cognitive modality encoder.

LLMs are having a very difficult time also on geospatial understanding as well, where they are far from being considered “geospatially aware”. In addition, numerical global coordinates are needed to be processed accordingly, where notions of proximity and adjacency should be accurately reflected in the linguistic embedding space. Therefore, incorporating locations as a special geospatial modality could also provide a solution to this problem with specifically designed encoder and joint training. In addition to these examples, the first potential entities that could be incorporated as a modality come to mind are people, institutions, etc.

The authors argue this type of approach promises to solve parametric/non-parametric knowledge scaling and context length limitation, as the complexity and information can be distributed to numerous modality encoders. This might also solve the problems of injecting updated information via modalities. Researchers just provide the boundaries of such a potential framework and discuss the promises and challenges of developing an entity-driven language model.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

🚀 LLMWare Launches SLIMs: Small Specialized Function-Calling Models for Multi-Step Automation [Check out all the models]