Unlocking Precision in Text-Guided Image and 3D Scene Editing: Meet ‘Watch Your Steps’

Neural radiation fields (NeRFs) are significantly growing in popularity thanks to their ability to create accurate and intuitive visualizations. This has led to the idea of altering NeRFs to change images. Denoising diffusion models have also been able to produce remarkably good images from textual descriptions and have become popular for image editing because of their effectiveness. Despite the promise of diffusion-based picture editing techniques, an automated methodology to identify the areas that need modification is conspicuously lacking. The methods currently used either rely on user-provided masks, employ the global information found in noisy inputs as a starting point, or depend on the input data to determine how the denoising process will be carried out.

However, these approaches typically have a tendency to over-edit. Even the IN2N application for NeRF editing runs into problems with excessive scene editing. Similar to IP2P, DiffEdit uses noise predictions led by captions to locate edit zones, although this method is slower and less efficient. A team of researchers has presented a unique approach for identifying and localizing the precise area inside an image that needs to be changed in accordance with a particular textual instruction. Known as Watch Your Steps, this approach supports Local Image and Scene Editing by Text Instructions.

The team has uncovered a key distinction between the predictions made by IP2P with and without the instruction by utilizing the capabilities of InstructPix2Pix (IP2P). This difference has been called the relevance map. The relevance map basically serves as a road map, illustrating the importance of changing specific pixels to reach the desired modifications. It acts as a guide for making changes, ensuring that only the necessary pixels are changed while leaving the unnecessary ones alone.

The team has shared that the relevance maps are also useful for more than just basic image editing, as they even improve the accuracy of text-guided alterations in the context of 3D scenes, especially those modeled by neural radiance fields. To do this, utilizing the relevance maps connected to different training views, a relevance field has been trained. The 3D region that should be altered to achieve the intended modifications has been effectively defined by this relevance field, and thus, the process entails rendering relevance maps from the established relevance field to guide iteratively updating the training views.

Upon evaluation, it was seen that this method achieved a level of performance that is unmatched for Neural Radiance Field (NeRF) editing jobs as well as image editing. This demonstrated the value and excellence of this approach in overcoming the difficulties presented by manipulating images and scenes.

Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 29k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, please follow us on Twitter

Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...