MIT And IBM Team Developed A Neural Network Pipeline, Called ‘RISP’, That Can Capture The Characteristics Of A Physical System’s Dynamic Motion From Video


Please Don't Forget To Join Our ML Subreddit

Recording the movement of objects or people is known as motion capture (mocap). The technology was developed for gait analysis in the life sciences industry. It is now widely used by VFX studios, sports therapists, neuroscientists, and for computer vision and robotics validation and control, allowing engineers to interpret and mimic action in real-world contexts.

External hardware platforms are required for traditional solutions, which are prohibitively expensive. Recent advancements in differentiable simulation and rendering offer a low-cost and appealing alternative. However, they usually assume that the videos are from a well-known renderer. Due to the disparity between rendering and real-world films, such an assumption restricts their utility in inferring dynamic information from an unknown rendering domain, which is typical in real-world applications.

Researchers are attempting to shift the responsibility to neural networks, which could extract this information from a short movie and duplicate it in a model. Because it can characterize realistic, continuous, dynamic motion from images and transform back and forth between a 2D render and a 3D scene in the world, work in physics simulations and rendering offers promise in making this more commonly used. On the other hand, current techniques necessitate exact information about the environmental parameters in which the action takes place and the selection of a renderer, both of which are frequently unavailable.

MIT and IBM researchers devised a trained neural network pipeline to combat these difficulties. Their paper, “RISP: RENDERING-INVARIANT STATE PREDICTOR WITH DIFFERENTIABLE SIMULATION AND RENDERING FOR CROSS-DOMAIN PARAMETER ESTIMATION,” shows how their approach can infer the state of the environment and the actions taking place. It can also infer the physical characteristics of the object or person of interest (system) and its control parameters. The results reveal that this strategy outperforms previous methods in simulations of four physical systems under various environmental variables. In addition, the technology enables imitation learning, which is the ability to predict and reproduce the trajectory of a real-world flying quadrotor from a video.


In their work, the researchers stress the importance of ignoring the rendering differences in video clips and focusing on the fundamental information about the dynamic system or dynamic motion. The lighting circumstances, backdrop information, texture information, and other factors influence the photographs or films. In a real-world context, these are not always measurable. Because the lighting conditions are so variable, you’ll obtain dramatically different video clips if you take a video of a leopard sprinting in the morning and the evening. But what matters most is the leopard’s dynamic motion: its joint angles, not whether they appear bright or dark.

The team created a pipeline system with a neural network called the “rendering invariant state-prediction (RISP)” network to solve the problem of rendering domains and picture discrepancies. RISP converts variations in picture (pixel) differences to differences in system states — i.e., the action environment. This makes their method generalizable and independent of rendering setups. Random rendering parameters and states are given into a differentiable renderer, a form of the renderer that assesses the sensitivity of pixels to rendering configurations to train RISP. This generates a variety of images and videos using known ground-truth parameters, allowing RISP to reverse the process and forecast the state of the environment using the supplied video.

RISP’s rendering gradients were also reduced, making its predictions less sensitive to changes in rendering configurations, allowing it to learn to ignore visual appearances and focus on learning dynamical states. A differentiable renderer makes this possible.

The approach then employs two parallel pipelines that are comparable to one another. The first is for the source domain, which has well-defined variables. A differentiable simulation is used to enter system parameters and operations. The states of the created simulation are merged with various rendering setups in a differentiable renderer to produce visuals supplied into RISP. RISP then generates predictions regarding the status of the environment.

Simultaneously, a target domain pipeline with unknown variables is executed. These output images are sent to RISP in this pipeline, which generates a predicted state. A new loss is generated when the projected states from the source and target domains are compared; this difference is utilized to alter and improve some of the parameters in the source domain pipeline. This process can then be repeated, lowering the loss between the pipelines even further.

The team put their method to the test in four different simulated systems: 

  1. A Quadrotor: a rigid flying body with no physical touch
  2. A Cube: a rigid body that interacts with its environment
  3. An articulated hand
  4. A-Rod: a deformable body that can move like a snake. 

They also constructed baselines and an oracle to compare the innovative RISP process in these systems to similar approaches. This includes scenarios where the methods do not include the rendering gradient loss, do not train a neural network with any loss, or do not include the RISP neural network at all. The researchers also looked at how the gradient loss affected the performance of the state prediction model over time. Finally, the researchers used their RISP system to extract motion from footage of a real-world quadrotor with complex dynamics. They compared the results to other strategies that didn’t use a loss function and relied on pixel disparities and one that required the manual setting of a renderer.

The results show that the RISP beat similar or state-of-the-art methods available in nearly all studies, imitating or duplicating the necessary parameters or motion. It is a data-efficient and generalizable competitor to current motion capture systems.

The researchers established two key assumptions for this study: 

  1. Information about the camera, such as its position and settings, as well as the geometry
  2. Physics regulating the item or person being monitored are both known. 

They plan to address these in their future work. 



Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.