This AI Research Proposes PerSAM: A Training-Free Personalization Approach For The Segment Anything Model (SAM)

Extensive availability of pre-training data and computing resources, foundation models in vision, language, and multi-modality have become more common. They exhibit varied interactions, including human feedback and exceptional generalization power in zero-shot settings. Segment Anything (SAM) creates a delicate data engine for gathering 11M image-mask data, then trains a potent segmentation foundation model known as SAM, using inspiration from the successes of huge language models. It begins by defining a brand-new promptable segmentation paradigm, which inputs a constructed prompt and outputs the anticipated mask. Any object in a visual environment may be segmented using SAM’s acceptable prompt, which includes points, boxes, masks, and free-form words. 

Figure 1: Personalization of the Segment Anything Model. For certain visual notions, such as your favorite dog, they tailor the Segment Anything Model (SAM). They provide two effective solutions using only one-shot data: a training-free PerSAM and a fine-tuning PerSAM-F. The images shown here come from DreamBooth.

However, SAM is unable to partition certain visual notions by nature. Imagine wanting to remove the clock from a shot of your bedroom or crop out your adorable pet dog from a photo album. Using the standard SAM model would take a lot of time and effort. You must find the target item in each image in various positions or situations before activating SAM and giving it specific instructions for segmentation. Therefore, they inquire whether they can quickly customize SAM to partition distinctive graphic notions. To do this, researchers from Shanghai Artificial Intelligence Laboratory, CUHK MMLab, Tencent Youtu Lab, CFCS, School of CS and Peking University suggest PerSAM, a customization strategy for the Segment Anything Model that requires no training. Using only one-shot data—a user-provided image and a crude mask denoting the personal concept—their technique effectively customizes SAM. 

They present three approaches to releasing SAM’s decoder’s personalization potential while processing the test image. To be more precise, they first encode the target object’s embedding in the reference picture using SAM’s image encoder and the supplied mask. The feature similarity between the item and each pixel in the new test picture is then calculated. The estimated feature similarity directs each token-to-image cross-attention layer in the SAM decoder. Additionally, two points are chosen as the positive-negative pair and encoded as prompt tokens to provide SAM with a location beforehand. 

As a result, for efficient feature interaction, the prompt tokens are forced to focus primarily on front target areas. 

• Focused, directed attention

• Target-specific Prompting

• Caledonia Post-refinement

They implement a two-step post-refinement technique for results in sharper segmentation. They use SAM to improve the produced mask gradually. It only adds 100ms to the process. 

As shown in Figure 2, PerSAM exhibits good personalized segmentation performance for a single participant in a range of positions or settings when using the designs above. However, there may occasionally be failure scenarios when the subject has hierarchical structures that need to be segmented, such as the top of a container, the head of a toy robot, or a cap on top of a teddy bear.

Figure 2. Personalization Examples of Our Approach. The training-free PerSAM (Left) customizes SAM to segment user-provided objects in any poses or scenes with favorable performance. On top of this, PerSAM-F (Right) further enhances the segmentation accuracy by efficiently fine-tuning only 2 parameters within 10 seconds

Given that SAM may accept both the local component and the global form as acceptable masks at the pixel level, this uncertainty makes it difficult for PerSAM to choose the right size for the segmentation output. To ease this, they also present PerSAM-F, a fine-tuning variation of their methodology. They fine-tune two parameters within 10 seconds while freezing the entire SAM to maintain its pre-trained knowledge. They specifically allow SAM to provide numerous segmentation results with various mask scales. They use learnable relative weights for each scale and a weighted summation as the final mask output to choose the optimum scale for different items adaptively. 

As can be seen in Figure 2 (Right), PerSAM-T displays improved segmentation accuracy thanks to this effective one-shot training. The ambiguity problem can be effectively controlled by weighting multi-scale masks rather than prompt tuning or adapters. They also note that their method can let DreamBooth better fine-tune Stable Diffusion for customized text-to-image production. DreamBooth and its associated works take a small set of photos having a particular visual notion, like your favorite cat, and turn them into an identifier in the word embedding space that is subsequently used to represent the target item in the phrase. However, the identifier includes visual details about the provided photographs’ backgrounds, such as stairs. 

This would override the new backgrounds in the generated images and disturb the representation learning of the target object. Therefore, they propose to leverage their PerSAM to segment the target object efficiently and only supervise Stable Diffusion by the foreground area in the few-shot images, enabling more diverse and higher-fidelity synthesis. They summarize the contributions of their paper as follows: 

• Personalized Segmentation Task. From a new standpoint, they investigate how to customize segmentation foundation models into personalized scenarios with minimal expense, i.e., from general to private purposes. 

• Efficient Adaption of SAM. They investigate for the first time how to modify SAM for downstream applications by merely adjusting two parameters, and they present two simple solutions: PerSAM and PerSAM-F. 

• Evaluation of Personalization. They add annotations to PerSeg, a brand-new segmentation dataset containing numerous categories in various circumstances. Additionally, they test their strategy using effective video object segmentation. 

• Improved Stable Diffusion Personalization. The segmentation of the target item in the few-shot photos reduces background noise and enhances DreamBooth’s ability to generate custom content.

Check out the Paper and Code. Don’t forget to join our 21k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at

🚀 Check Out 100’s AI Tools in AI Tools Club

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...