UT Austin and Sony AI’s VIOLA object-centered imitation learning method for robot manipulation outperforms SOTA method by 45.8%


Vision-based manipulation is a key skill that enables autonomous robots to understand their environment and derive intelligent behaviors from it. Deep imitation learning has recently emerged as a promising training method for vision-based manipulation, and although the resulting models perform well on mapping raw visual observations to motor actions, they are not robust to changes. covariates or environmental disturbances, which results in a poor ability to generalize to new situations.

In the new newspaper VIOLA: Imitation learning for vision-based manipulation with object proposition priors, researchers from the University of Texas at Austin and Sony AI present VIOLA (Visuomotor Imitation via Object-centric LeArning), an object-centric imitation learning model that brings awareness to imitation learning objects and their interactions. The new approach improves the robustness of vision-based robotic manipulation tasks and outperforms state-of-the-art imitation learning methods by 45.8%.

VIOLA is designed to learn effective closed-loop visuomotor policies for robot manipulation and was inspired by the idea that explaining visual scenes as multiple objects and their corresponding interactions could enable models to make predictions. faster and more precise. The proposed method thus decomposes visual scenes into factorized representations of objects to encourage robots to reason about the manipulation workspace in a modular way and improve their ability to generalize.

VIOLA first uses a pre-trained region proposition network (RPN) to obtain a set of general object propositions from raw visual observations, then extracts the features of each of these propositions to learn the centered factorized representations on the object of the visual scene. Finally, a transformer-based policy takes advantage of a multi-headed self-attention mechanism to identify task-relevant regions and improve the robustness and efficiency of the imitation learning process.

The team compared VIOLA to state-of-the-art deep imitation learning methods on vision-based manipulation tasks using a real robot. In evaluations, VIOLA exceeded the success rate of the best state-of-the-art baseline by 45.8% and maintained its robustness on precise grasping and manipulation tasks, even when visual variations such as choppy camera views were introduced.

The team summarizes the contributions of their study as follows:

  1. We learn object-centered representations based on general object propositions and design a transformer-based policy that determines task-relevant propositions to generate the robot’s actions.
  2. We show that VIOLA outperforms state-of-the-art simulation benchmarks and validate the effectiveness of our model designs through ablative studies.
  3. We show that VIOLA learns policies on a real robot to perform difficult tasks.

Videos and model details are available on the project website: https://ut-austin-rpl.github.io/VIOLA. The paper VIOLA: Imitation learning for vision-based manipulation with object proposition priors is on arXiv.

Author: Hecate He | Editor: Michel Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Weekly Synchronized Global AI to get weekly AI updates.


About Author

Comments are closed.