Microsoft Research Asia and The Institute of Automation, CAS present a unique module based on an attention-based decoder to integrate different computer vision (CV) object representations efficiently.
Object detection is an indispensable intricacy in computer vision that many visual applications model. Although there are various approaches to solve the problem, they leverage only individual visual representation format. For example, most object detection frameworks employ the rectangle box to represent object hypotheses in all intermediate stages. Some frameworks also choose points to represent an object hypothesis, e.g., center point in CenterNet and FCOS. CornerNet employs part representation of corner points to compose an object instead of representing whole objects. Different representation arrangements direct the detectors to perform well in particular aspects and thus have different strengths. For example, the center representation avoids the need for an anchoring design and is usually friendly to small objects. The corner representation is more accurate for sound localization, etc.
At present, CV object detection frameworks have improved immensely, targeting and delivering robust performance on one aspect of an object’s structure. However, it is challenging to integrate them into one framework due to the heterogeneous nature of feature extractions from distinct representations. Therefore, the researchers have proposed BVR (Bridging Visual Representations), a generic module that is convenient to use. This module combines different visual representations in one single framework to make reasonable use of each strength.
To model the dependencies between heterogeneous features efficiently, the researchers have utilized an attention-based decoder module, similar to that in Transformer. It uses the fundamental representations in an object detector as the query input. Other visual representations serve as supporting keys that enhance the query features concerning both appearance and geometric relationships.
The researchers have incorporated two novel techniques to address the computation and memory consumption problem. The methods include a key sampling approach and a shared location embedding approach into the BVR module.
When the researchers implemented BVR to each detector, they have witnessed 1.5 – 3.0 average accuracy increases. Experiments show that BVR is useful with four prevalent object detectors RetinaNet, Faster R-CNN, FCOS, and ATSS.
This work aims to help researchers to develop more reliable object detection algorithms. Since object detection is an indispensable part of object-oriented, visual applications, this module can help solve the related concerns. The team states that similar to other detectors, this method can have unpredictable failures. They urge not to use it for synopses where failures can lead to crushing outcomes. This algorithm’s results can be determined on the down-stream applications. The team also advises to be cautious about data collection when using this method as it is a data-driven method, and the performance may be affected by the data’s biases.