Researchers at Google AI recently developed a technology called IconNet that enables Android users to have hands-free control over their mobile devices using voice access. One key challenge of voice access technology is developing a tool that can identify the user interface elements, or UI, on a mobile screen. In the current scenario, most devices only have scarce resources at their disposal, restricting the usage of voice access to the very minimum. To overcome this significant roadblock, an efficient system that automatically detects the icons, making use of the pixel values instead of looking at the accessibility labels, is vital.
Current Challenges With UI Identification and Voice Access
The development of an on-device UI element can prove to be a tedious job because the detector has to run on an entire plethora of phones with different performance capabilities. While doing so, it also has to make sure that user privacy is not endangered. The best UI elements are available on the lightweight models that have low inference latencies. Interface latencies are short periods of delay between when an audio signal enters a system to when it finally emerges. Usually inference latencies last only for milliseconds. Voice access makes use of the labels as a response to the utterance by the user; therefore, the lower the inference time, the better the performance.
IconNet is an emerging system based on the internet architecture CenterNet that automatically detects icons on the screen in correspondence to the underlying structure of the app. It has been developed as a part of the recently launched Voice Access system. Currently, IconNet can detect 31 different icon types, but in the near future, the number is expected to go up to 70. Running at 9FPS on a Pixel 3A, IconNet has a precision of about 94.2% that aims to give the users a perfect and consistent experience.
How IconNet Works
IconNet works by utilizing the input images, extracting their features, and then accurately pin-pointing the box centers as well as the size of the icon. UI elements are typically symmetrical and straightforward geometric shapes, due to which using the architecture of CenterNet proves to be best suited as it optimizes the task and detects the center of the icons effortlessly. A combination of L1 loss and CornerNet Focal loss is employed for better prediction of the center. By administering the CornerNet Focal loss, the icon class imbalance is also done away with. For the backbone of the system, Hourglass was chosen as it offers a propitious server-side architecture for both the icon as well as the UI element detection.
After the initial phase of selecting the architectures was concluded, the focal point was shifted towards the neural architecture search (NAS) to look for probable variations that would aid in bringing about an equilibrium between the model performance (mAP) and latency (FLOPs). Furthermore, Fine-grained Stochastic Architectural Search (FiGS) was also used to better the backbone design. With the help of FiGS, sparse structures were uncovered, and the irrelevant connections were scrapped that reduced the model size by 20% without any impact on the performance.
For enhancing the inference time, the model was further rectified by the Neural Networks API on Qualcomm DSPs by simply transfiguring the model to make use of the 8-bit integer quantization. The new model now runs six times better than the original one with an additional benefit of 50% size reduction by losing only a minor 0.5% mAP.
In order to truly assess how IconNet was better than the other detectors available, a traditional approach like mAP was sought alongside the use of false-positive detection wherein any incorrect detection of the icon was penalized and the center in region of interest (CIROI), which is a metric designed to return in a positive match whenever the center of the detected bounding box is inside of the ground truth bounding box. This assessment’s outcome was in favor of IconNet, which managed to outshine the other mobile compatible icon detectors.
IconNet is poised to become a leader in the future of icon detection. While the current model is showing an encouraging response, constant efforts are being aligned towards bettering this technology and increasing the range and diversity of the elements that can be detected. Simultaneously, another detailed feature will also be included that will help differentiate between the icons that have similar appearances by putting into sharp focus their functionality.