Meet ‘DreamFusion,’ An Effective AI Technique That Uses Machine Learning To Synthesize 3D Models From Text Prompts

By prompting a text-to-image model we can generate images of a wide variety of objects. With clever prompting, it’s also possible to synthesize different perspectives of a specific object:

“A cat as seen from the back” and “A cat as seen from the front”. Generated with Stable Diffusion | Source:

The question researchers at Google posed is: can we integrate these perspectives into a single, coherent 3d object?


Consider a ray-casting camera, only able to see through space in a straight line (a ray), up to a given distance. For every point hit along the ray’s path, we learn whether an object is present and its color at that location.

When an object is in the ray’s path, volume density σ changes. | Source:

A ray can be uniquely identified by a starting x,y,z coordinate, pitch, and yaw. In NeRFs, a neural network (usually an MLP) is trained to predict densities and RGB colors along the ray; this is repeated from randomly sampled rays, netting us with a neural embedding containing the 3d scene we just sampled.

Training NeRFs

| Source:

We can use 2d images coupled with their relative camera position (usually obtained via photogrammetry) to retrieve density and color for every pixel of every image. To train the network, we ask it to predict a given ray, integrate the predicted density and color over its entire length, and calculate the loss between this value and the values gathered from the image. 

Once trained, to retrieve a 3d model from the network, we ray-cast in a grid pattern to get a voxel map and then run the marching cubes algorithm to get a good approximation of the underlying 3d structure.

A 2d visualization of the marching cubes algorithm.| Source:

DreamFusion: Training NeRFs using Imagen

| Source:

To use text to generate 3d objects, we start by feeding a caption to a text-to-image model, such as Imagen, adding minor tweaks to the prompt depending on the random camera viewpoint we want to generate, such as “front view” “top view” or “side view”.
We then feed the same position and angle parameters to an untrained NeRF model to predict an initial image rendering of the object. The initial render is then fed to Imagen, guided by our caption and initial rendering with added noise using our pre-trained text-to-image model. Noise is removed from the resulting higher-quality image, with the result being used to train NeRF: this is done so that the model is only trained on the improved parts of the image. The process is repeated until the 3D model is satisfying enough, after which it can be exported using the grid ray-cast + marching cubes procedure.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'DreamFusion: Text-to-3D using 2D Diffusion. All Credit For This Research Goes To Researchers on This Project. Check out the paper and github link

Please Don't Forget To Join Our ML Subreddit

Martino Russi is a ML engineer who holds a master’s degree in AI from the University of Sussex. He has a keen interest for reinforcement learning, computer vision and human-computer interaction. He is currently researching unsupervised reinforcement learning and developing low-cost, high-dimensional control interfaces for robotics

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...