Google AI Introduces Robotics Transformer 1 (RT-1), A Multi-Task Model That Tokenizes Robot Inputs And Outputs Actions To Enable Efficient Inference At Runtime (2024)

0 Shares

The primary source of the most recent technological advancements we see today in numerous machine learning subfields is the knowledge transfer that occurs from large task-agnostic datasets to expressive models that can effectively absorb all this data. This capability has been demonstrated remarkably previously when it comes to domains like computer vision, natural language processing, and speech recognition. However, its application still remains undetermined when it comes to robotics. One of the major elements that contribute to this limitation is the absence of extensive and diverse robotic data, which restricts a model’s capacity to take in a wide range of robotic experiences. Moreover, another concern is the lack of scalable models and their capacity to generalize learning from such huge datasets.

Researchers from Google AI worked in this direction and detailed that a combination of open-ended task-agnostic training and high-capacity architecture capable of taking in all of the different robotic data is the key to success for general robotic models. To test their hypotheses, a team of researchers at Google AI created Robotics Transformer 1 (RT-1), a multi-task model that tokenizes robot inputs and outputs actions to facilitate effective inference at runtime and enable real-time control. This model was developed using a real-world robotics dataset of more than 130k episodes that was gathered using 13 robots from Everyday Robots (EDR) over an extended period.

RT-1’s main distinguishing characteristics are image tokenization, action tokenization, and token compression. The transformer architecture that underpins the design of RT-1 allows it to effectively generate tokenized actions from its inputs which include a brief history of images taken by the robot’s camera and task descriptions written in natural language. The input images are run through a model that is pre-trained on ImageNet during the image tokenization step, and the output is then flattened accordingly. The image tokenizer then employs FiLM layers to extract image features necessary for the task at hand. In order to learn with the attention module TokenLearner, the model adaptively chooses soft combinations of compressible image tokens. This is what results in an inference speed-up.

The researchers emphasized the need to have a large and diverse dataset of robot trajectories in order to develop such a system that could generalize to new tasks and demonstrate robustness to various distractors and backgrounds. The researchers used 13 EDR robots to collect 130k episodes over 17 months to create such a dataset. The dataset includes activities like picking and arranging objects, opening, and closing drawers, knocking things over, etc. Additionally, they added a written description of the robot’s action as an annotation for each episode.

The team assessed the generalization capabilities and performance of RT-1 against three baseline models in four categories: performance on known tasks, performance on unseen tasks, robustness, and long-horizon scenarios. In all four areas, RT-1 performs much better than baselines, showing significantly superior zero-shot generalization to novel tasks, environments, and objects. They also thoroughly examined the effects of tokenization, action representation, dataset composition, and numerous other design decisions that went into the model and training set.

In a nutshell, the RT-1 Robotics Transformer is a straightforward and scalable action-generation model appropriate for real-world robotics tasks. When it comes to future work, the researchers will focus on scaling the number of robot skills faster by creating techniques that even allows beginners to train the robot via guided data collection and model prompting. They anticipate that scalable attention and memory will enhance robotic transformers’ response times and retention ability. Google has also open-sourced the RT-1 code in hopes that it will prove to be a useful tool for upcoming research on scaling robot learning. The project’s website and other details can be accessed here.

Check out thePaperandBlog.All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to joinour Reddit pageanddiscord channel, where we share the latest AI research news, cool AI projects, and more.

Meet Hailo-8™: An AI Processor That Uses Computer Vision For Multi-Camera Multi-Person Re-Identification

Khushboo Gupta

+ posts

Khushboo Gupta is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Goa. She is passionate about the fields of Machine Learning, Natural Language Processing and Web Development. She enjoys learning more about the technical field by participating in several challenges.