Simultaneous multimodal experiences are an important facilitator of perceptual learning in humans. However, building a unified model for modality fusion in artificial learning systems is difficult due to a variety of factors: (i) differences in learning dynamics between modalities, (ii) noise topologies, with some modality streams holding more information for the task at hand than others, and (iii) specialized input representations.
The distinction between audio and vision input representations is particularly stark: many state-of-the-art audio classification systems rely on short-term Fourier analysis to produce log-mel spectrograms, which are frequently used as inputs to CNN structures developed for images. These time-frequency representations differ from images in that many acoustic objects can have energy at the same frequency. Therefore CNN translation invariances may no longer be desirable (while an acoustic object can be shifted in time, a shift in frequency could change the meaning entirely).
While distinct spatial portions of the image correspond to different objects, the visual stream in a video is three-dimensional (one temporal and two spatial), and there is the unique difficulty of high redundancy over numerous frames. As a result, input representations, and thus neural network topologies and benchmarks, varied dramatically between modalities. For the sake of simplicity, the most common multimodal fusion paradigm is an ad-hoc approach that includes combining distinct audio and visual networks via their output representations or scores, a process known as ‘late-fusion.’
Google researchers recently published a study describing a novel transformer-based model for audio-visual fusion in the video. Despite their origins as NLP models, transformers have recently gained popularity as universal perceptual models due to their ability to describe rich token correlations while making few assumptions about their inputs (and because continuous perceptual inputs can be tokenized).
Transformers have been demonstrated to perform competitively for image (ViT), video (ViViT), and more recently, audio classification by breaking dense continuous signals into patches and rasterizing them to 1D tokens (AST). Because these models can gracefully manage different length sequences, a natural first expansion would be to provide a transformer with a sequence of both visual and auditory patches, with little architecture changes.
Free attention flow between distinct spatial and temporal regions in the image, as well as across frequency and time in the audio spectrogram, is possible with this ‘early fusion’ concept. Researchers believe that full paired attention at all layers of the model is not required since audio and visual inputs contain rich, fine-grained information, much of which is redundant. Due to the quadratic complexity of paired attention with token sequence length, such a model would not scale well to longer films.
To counteract this, the team suggests two strategies for limiting attention flow in our model. The first is based on a typical multimodal learning model in which cross-modal flow is restricted to the network’s later layers, allowing early layers to focus on learning and extracting unimodal patterns. This is now referred to as ‘mid fusion,’ with the fusion layer being the layer where cross-modal interactions are introduced. ‘Early fusion’ (all layers are cross-modal) and ‘late fusion’ (all layers are unimodal) are the two extreme examples of this, which researchers compare to as baselines.
Restriction of cross-modal attention flow across tokens within a layer is the second method (and the key contribution). The team does this by allowing free attention flow within a modality while forcing the model to gather and ‘condense’ information from each modality before sharing it with the others. The basic idea is to create a small number of latent fusion units that operate as an ‘attention bottleneck,’ forcing cross-modal interactions within a layer to pass through them. Researchers show that this ‘bottlenecked’ version, known as the Multimodal Bottleneck Transformer (MBT), outperforms or rivals its unrestricted equivalent while costing less to compute.
AudioSet, Epic-Kitchens-100, and VGGSound are three video categorization datasets that the team is experimenting with. Our backbone architecture is identical to that of ViT; specifically, researchers employ ViT-Base initialized from ImageNet-21K, but they emphasize that the method is backbone agnostic and may be used to any other transformer backbone.
Multimodal fusion beats the higher-performing single modality baseline across all datasets, illustrating the value of complementary data. The team points out that modalities’ relative value for classifying labels vary (audio-only has higher relative performance for AudioSet and lower for Epic-Kitchens, while both audio and visual baselines are equally strong for VGGSound). This is partly due to the dataset annotation technique, and it distinguishes VGGSound as a fusion-ready dataset.
Researchers also discovered that audio-visual fusion improves performance for datasets that were previously purely video-based, such as Kinetics and Moments in Time. The team also looked at per-class performance on the Audioset dataset and discovered that for almost all (57 out of 60) of the top classes (ordered by overall performance), audio-visual fusion outperforms audio-visual fusion.
Google researchers propose a new transformer architecture (MBT) for audio-visual fusion and investigate a variety of fusion tactics based on latent token cross-attention. They offer a novel technique for limiting cross-modal attention via a small number of fusion ‘bottlenecks’ and show that it outperforms vanilla cross-attention at a reduced computational cost, obtaining state-of-the-art results on a variety of benchmarks. MBT will be extended to other modalities in the future, such as text and optical flow.