Meet HuggingGPT: A Framework That Leverages LLMs to Connect Various AI Models in Machine Learning Communities (Hugging Face) to Solve AI Tasks

Because of their impressive results on a wide range of NLP tasks, large language models (LLMs) like ChatGPT have garnered great interest from researchers and businesses alike. Using reinforcement learning from human feedback (RLHF) and extensive pre-training on enormous text corpora, LLMs can generate greater language understanding, generation, interaction, and reasoning capabilities. The vast potential of LLMs has sparked a plethora of new areas of study, and the resulting opportunities to develop cutting-edge AI systems are virtually limitless.

LLMs must collaborate with other models to harness their full potential and take on challenging AI jobs. Therefore, picking the right middleware to establish communication channels between LLMs and AI models is paramount. To solve this issue, researchers recognize that each AI model may be represented as a language by summarizing the model function. As a result, researchers propose the idea that “LLMs use language as a generic interface to link together various AI models.” Specifically, LLMs can be viewed as the central nervous system for managing AI models like planning, scheduling, and cooperation since they include model descriptions in prompts. As a result, LLMs can now use this tactic to call upon third-party models to complete AI-related activities. Yet, another difficulty arises if one wishes to incorporate various AI models into LLMs: to do many AI tasks, they need to collect many high-quality model descriptions, which demands intensive rapid engineering. Many public ML communities have a wide selection of suitable models for solving specific AI tasks, including language, vision, and voice, and these models have clear and concise descriptions.

HuggingGPT, which can process inputs from several modalities and solve numerous complex AI problems, is proposed by the research team to connect LLMs (i.e., ChatGPT) and the ML community (i.e., Hugging Face). To communicate with ChatGPT, researchers combine the model description from the library corresponding to each AI model in Hugging Face with the prompt. After then, LLMs (i.e., ChatGPT) will be the system’s “brain” to answer users’ inquiries.

Researchers and developers can work together on natural language processing models and datasets with the help of HuggingFace Hub. As a bonus, it has a straightforward user interface for locating and downloading ready-to-use models for various NLP applications.

HuggingGPT phases

HuggingGPT can be broken down into four distinct steps:

  • Task Planning: Utilizing ChatGPT to interpret user requests for meaning, then breaking them down into discrete, actionable tasks with on-screen guidance.
  • Model Selection: Based on the model descriptions, ChatGPT chooses expert models stored on Hugging Face to complete the predetermined tasks.
  • Task Execution: Call and run each chosen model, then report back to ChatGPT on the outcomes.
  • After integrating the forecast of all models with ChatGPT, the final step is to generate answers for users.

To examine closely –

HuggingGPT begins with a huge language model breaking down a user request into discrete steps. The large language model must establish task relationships and order while dealing with complex demands. HuggingGPT uses a combination of specification-based instruction and demonstration-based parsing in its quick design to guide the large language model toward efficient task planning. The next paragraphs serve as an introduction to these specifics.

HuggingGPT must then select the appropriate model for each task in the task list after parsing the list of functions. Researchers do this by pulling expert model descriptions from the Hugging Face Hub and then using the in-context task-model assignment mechanism to dynamically choose which models to apply to certain tasks. This method is more adaptable and open (describe the expert models; anyone can use them gradually).

The next step after a model has been given a task is to carry it out, a process known as model inference. HuggingGPT utilizes hybrid inference endpoints to speed up and ensure the computational stability of these models. The models receive the task arguments as inputs, perform the necessary computations, and then return the inference results to the larger language model. Models without resource dependencies can be parallelized to increase inference efficiency even more. This allows for the simultaneous initiation of numerous tasks with all their dependencies met.

HuggingGPT moves into the response-generating step once all tasks have been executed. HuggingGPT compiles the findings of the previous three steps (task planning, model selection, and task execution) into a single, cohesive report. This report details the tasks that were planned, the models that were chosen for those tasks, and the inferences that were drawn from those models.


  • It offers intermodel cooperation protocols to supplement the benefits of large linguistic and expert models. New approaches to creating general AI models are made possible by separating the large language models, which work as the brains for planning and decision-making, from the smaller models, which act as the executors for each given task.
  • By connecting the Hugging Face hub to more than 400 task-specific models centered on ChatGPT, researchers could create HuggingGPT and take on broad classes of AI problems. HuggingGPT’s users can access dependable multimodal chat services thanks to the models’ open collaboration.
  • Numerous trials on various difficult AI tasks in language, vision, speech, and cross-modality show that HuggingGPT can grasp and solve complicated tasks across multiple modalities and domains.


  • HuggingGPT can perform various complex AI tasks and integrate multimodal perceptual skills because its design allows it to employ external models.
  • In addition, HuggingGPT can keep soaking up knowledge from domain-specific specialists thanks to this pipeline, enabling expandable and scalable AI capabilities.
  • HuggingGPT has incorporated hundreds of Hugging Face models around ChatGPT, spanning 24 tasks like text classification, object detection, semantic segmentation, image generation, question answering, text-to-speech, and text-to-video. The experimental results show that HuggingGPT can handle complex AI tasks and multimodal data.


  • There will always be restrictions with HuggingGPT. Efficiency is a major concern for us since it represents a potential barrier to success.
  • The inference of the massive language model is the main efficiency bottleneck. HuggingGPT must engage with the huge language model multiple times per user request round. This occurs during task planning, model selection, and response generation. These exchanges significantly lengthen response times, lowering end users’ service quality. The second is the maximum length restriction placed on contexts.
  • HuggingGPT has a maximum context length restriction because of the LLM’s maximum allowed number of tokens. To address this, studies have focused solely on the task-planning phase of the dialog window and context tracking.
  • The primary concern is the reliability of the system as a whole. While inferring, large language models can occasionally deviate from the instructions, and the output format can sometimes surprise developers. The insurrection of very big language models during inference is one example.
  • There’s also the issue of the Hugging Face inference endpoint’s expert model needing more manageable. Hugging Face’s expert models may have failed during the job execution phase due to network latency or service status.

The source code can be found in a directory called “JARVIS”

In conclusion

Improving AI requires solving challenging problems across a variety of areas and modalities. While many AI models exist, they must be more powerful to handle complex AI tasks. LLMs could be a controller to manage existing AI models to perform complex AI tasks. Language is a generic interface because LLMs have demonstrated outstanding language processing, generation, interaction, and reasoning competence. In keeping with this idea, researchers present HuggingGPT. This framework uses LLMs (like ChatGPT) to link different AI models from other communities of machine learners (like Hugging Face) to complete AI-related tasks. More specifically, it utilizes ChatGPT to organize tasks after receiving a user request, choose models based on the descriptions of their functions in Hugging Face, run each subtask using the chosen AI model, and compile a response from the outcomes of the runs. HuggingGPT paves the path for cutting-edge AI by utilizing ChatGPT’s superior language capacity and Hugging Face’s wealth of AI models to perform a wide range of complex AI tasks across several modalities and domains, with amazing outcomes in areas such as language, vision, voice, and more.

Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 17k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...