Speaking robot: Our new AI model translates vision and language into robotic actions (2024)

For decades, when people have imagined the distant future, they’ve almost always included a starring role for robots. Robots have been cast as dependable, helpful and even charming. Yet across those same decades, the technology has remained elusive — stuck in the imagined realm of science fiction.

Today, we’re introducing a new advancement in robotics that brings us closer to a future of helpful robots. Robotics Transformer 2, or RT-2, is a first-of-its-kind vision-language-action (VLA) model. A Transformer-based model trained on text and images from the web, RT-2 can directly output robotic actions. Just like language models are trained on text from the web to learn general ideas and concepts, RT-2 transfers knowledge from web data to inform robot behavior.

The real-world challenges of robot learning

The pursuit of helpful robots has always been a herculean effort, because a robot capable of doing general tasks in the world needs to be able to handle complex, abstract tasks in highly variable environments — especially ones it's never seen before.

Unlike chatbots, robots need “grounding” in the real world and their abilities. Their training isn’t just about, say, learning everything there is to know about an apple: how it grows, its physical properties, or even that one purportedly landed on Sir Isaac Newton’s head. A robot needs to be able to recognize an apple in context, distinguish it from a red ball, understand what it looks like, and most importantly, know how to pick it up.

That’s historically required training robots on billions of data points, firsthand, across every single object, environment, task and situation in the physical world — a prospect so time consuming and costly as to make it impractical for innovators. Learning is a challenging endeavor, and even more so for robots.

A new approach with RT-2

Recent work has improved robots’ ability to reason, even enabling them to use chain-of-thought prompting, a way to dissect multi-step problems. The introduction of vision models, like PaLM-E, helped robots make better sense of their surroundings. And RT-1 showed that Transformers, known for their ability to generalize information across systems, could even help different types of robots learn from each other.

But until now, robots ran on complex stacks of systems, with high-level reasoning and low-level manipulation systems playing an imperfect game of telephone to operate the robot. Imagine thinking about what you want to do, and then having to tell those actions to the rest of your body to get it to move. RT-2 removes that complexity and enables a single model to not only perform the complex reasoning seen in foundation models, but also output robot actions. Most importantly, it shows that with a small amount of robot training data, the system is able to transfer concepts embedded in its language and vision training data to direct robot actions — even for tasks it’s never been trained to do.

For example, if you wanted previous systems to be able to throw away a piece of trash, you would have to explicitly train them to be able to identify trash, as well as pick it up and throw it away. Because RT-2 is able to transfer knowledge from a large corpus of web data, it already has an idea of what trash is and can identify it without explicit training. It even has an idea of how to throw away the trash, even though it’s never been trained to take that action. And think about the abstract nature of trash — what was a bag of chips or a banana peel becomes trash after you eat them. RT-2 is able to make sense of that from its vision-language training data and do the job.

A brighter future for robotics

RT-2’s ability to transfer information to actions shows promise for robots to more rapidly adapt to novel situations and environments. In testing RT-2 models in more than 6,000 robotic trials, the team found that RT-2 functioned as well as our previous model, RT-1, on tasks in its training data, or “seen” tasks. And it almost doubled its performance on novel, unseen scenarios to 62% from RT-1’s 32%.

In other words, with RT-2, robots are able to learn more like we do — transferring learned concepts to new situations.

Not only does RT-2 show how advances in AI are cascading rapidly into robotics, it shows enormous promise for more general-purpose robots. While there is still a tremendous amount of work to be done to enable helpful robots in human-centered environments, RT-2 shows us an exciting future for robotics just within grasp.

Check out the full story on the Google DeepMind Blog.