An MIT And IBM Research Group Has Proposed A Cross-Modal Auditory Localization Framework To Locate Moving Objects Using Stereo Sound Instead of Visual Input


Object localization means locating an instance of a particular moving object in a scene. This is done by using visual data as the input with some applied physics and mathematics. However, there can be obstructions like low light conditions, fog, occlusions, etc., which may reduce the camera-based approach’s efficiency.

To improve object localization in such unfavorable situations, a research group from MIT and IBM has proposed a Cross-modal auditory localization framework using stereo sound to locate objects in a better way.


In certain situations where one has to localize an approaching ambulance in a busy street or a meowing cat in a dark room, auditory inputs might help more effectively than only visual inputs in locating the respective objects. Sound localization and cross-modal learning thus focus on this aspect. Sound localization analyzes delays in a sound by using microphone arrays and beam-forming. Accordingly, it can apply some physics and estimate the location of the object emitting the sound. Since audio-visual data contains miscellaneous information and transfers the knowledge between different modalities, cross-modal learning is also a growing research area.

The framework proposed by the MIT and IBM researchers comprises a “teacher” vision network and a “student” stereo sound network. The student network tries to mimic the teacher network outputs by transferring object detection knowledge across different modalities during the training process. 

  • The “teacher” vision network detects an object in a video and marks it with a bounding box around the object.
  • Then, the “student” stereo sound network learns to map audio signals to the predicted bounding box coordinates. 
  • In the final inference mode, the student network directly predicts an object’s location using sound based on previous learning, without any visual inputs.

Thus, various experiments on approximately 3000 video clips were done using the cross-modal auditory localization framework. Particularly under poor lighting conditions, such as night time scenarios where traditional visual tracking systems cannot work well, cross-modal auditory localization proves its potential to make object localization techniques and visual tracking systems more effective.



Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.