Multi-Modal Deep Learning For Behavior Understanding And Indoor Scene Recognition

Recognizing an indoor environment is not difficult for humans, but training an artificial intelligence (AI) system to distinguish various settings is. Indoor scene identification is a rapidly developing discipline with enormous potential in behavior analysis, robot localization, and geriatric monitoring, to name a few. AI systems are trained to recognize spaces solely through photos, and identifying a space alone through objects almost always goes wrong.

Social media videos’ accessibility and diversity can give realistic data for modern scene identification techniques and applications. It’s most commonly used to classify InstaIndoor, a new dataset of social media videos of indoor situations. Scientists from the University of Groningen propose a model based on merging recorded voice to text and visual elements.

Multimodal learning occurs when an idea is taught in several ways. Learners enjoy a diverse learning style that suits all of them when they engage their minds in numerous learning styles simultaneously. This serves as the foundation for the researchers’ work. The utilization of two unique modalities, in the form of optical video frames and transcribed speech as text data, is used to accomplish Multimodal learning methods.

Visual Modality:

Frames from the original input video are recorded at predetermined intervals to provide visual data. Due to the information-preserving nature of images, it is an immediate choice in terms of modalities. The time between two frames is indirectly proportional to their similarity. To preserve crucial information while eliminating near-duplicate images, the researchers sub-sample the films at a rate of one frame per second. ConvLSTM models can be used to analyze the extracted frame sequences as is. The path ahead is to employ pretrained convolutional neural networks (CNNs) for the same purpose.

Text Modality:

Contextual hints can be offered through transcribed textual information gathered from the input video’s sound in the form of conversations or compelling explanations provided by the user. The speaker likely refers to specialized fields with specialized words. As a result, such relevant terms can be associated with their appropriate indoor settings, allowing for improved classification performance using natural language processing approaches.

In addition, the Google toolbox allows users to transcribe audio into a language of their choice. Natural language processing techniques are used to preprocess the transcribed text once it has been retrieved. It is first normalized, with all letters converted to lowercase, punctuation elements removed, and stop words extracted. Stop words are exceedingly prevalent yet have little or no semantic meaning. Additionally, the preprocessed text is a defining feature that must be transformed into a numerical format or embedded before being fed into the multimodal network. Vectorization is a term used to describe the complete process.


Single Modality evaluation:

In a separate study, the researchers looked at single-modality algorithms for recognizing indoor scenes in InstaIndoor.

The scientists looked at two aspects when it came to visual features:

  • Frames: a ConvLSTM with a 512 units model is used, followed by a softmax activation layer.
  • Features of places: a dense, fully connected model was used. ImageNet distinguishes itself by employing a dense, fully connected model.

For the text features, the researchers evaluated the following :

  • Count Vectorizer: a model based on a 512-unit LSTM with a softmax activation layer is used.
  • Word2Vec Pad: a model based on a 512-unit LSTM with a softmax activation layer is used.
  • Word2Vec Sum: a dense, fully connected model is employed.
  • SentenceBERT: a thick, fully connected model is employed.


It’s evident from the single-modality baseline results that models with visual characteristics outperform models with text features. It’s usually attributable to the fact that the language in many videos provides insufficient information. ImageNet features are used in the best single-modality solution, achieving 61 percent accuracy.

Word2Vec Pad has the best performance in terms of text features, with a 17 percent accuracy rate. On the other hand, all text models tend to outperform random guessing by a bit of margin (11 percent ). The second-best model has much of the same properties as the first, except for the text feature, which reaches 69 percent accuracy. All of the top five models incorporate ImageNet Sum features, although the text processing and fusion strategies vary.

The researchers anticipate that this study’s contributions will pave the way for new research in the complex subject of indoor scene recognition.