Microsoft AI Research Introduces UFO: An Innovative UI-Focused Agent to Fulfill User Requests Tailored to Applications on Windows OS, Harnessing the Capabilities of GPT-Vision

Microsoft has recently released UFO, a UI-focused agent for specialized Windows OS Interaction. UFO addresses the challenges faced in interacting with the graphical user interface (GUI) of applications on the Windows operating system (OS) through natural language commands. LLMs have shown successful results in understanding and executing textual commands, but LLMs still are not able to navigate and operate within the UI of Windows applications.

Currently, existing models are majorly focused on smartphones or web applications, and the requirement of UI agents tailored specifically for the Windows OS environment remained unavailable. To fulfill the requirement, Microsoft’s researchers proposed UFO, a UI-focused agent designed for smooth interaction with Windows applications. UFO tailored a dual-agent framework comprising an Application Selection Agent (AppAgent) and an Action Selection Agent (ActAgent). They utilize GPT-Vision to analyze GUI screenshots and control information, which allows the agents to understand application selection and execute required actions. UFO also incorporates features such as control interaction, application switching, action customization, and safeguards to enhance its functionality and user experience.

UFO works by first analyzing the user’s request and the current desktop environment, which includes screenshots and available applications. Based on this analysis, the AppAgent selects an appropriate application and develops a global task completion strategy. While ActAgent then performs actions within the selected application, iteratively selecting controls and performing actions until the user request is fulfilled. UFO’s control interaction module makes it easier to translate selected actions into executable operations, allowing for automated execution without the need for human intervention. 

The framework is highly extensible and allows users to create custom actions and controls for specific tasks and applications. The proposed model is evaluated on a wide range of user requests to analyze its performance; the model demonstrated successful results on almost every task in Windows applications, highlighting its versatility and potential to increase user productivity.

In conclusion, the proposed model efficiently interacts with Windows applications through natural language commands. By leveraging GPT-Vision and a dual-agent framework, UFO demonstrates superior effectiveness in navigating and operating within Windows applications to fulfill user requests.

Check out the Paper and Github. All credit for this research goes to the researchers of this project.

