Between two worlds

5 mins read

Robot design is calling on the virtual domain to make smarter moves in real life, as Chris Edwards discovers.

Machine learning and accelerated simulation are driving humanoid robotic developments Credit: Nvidia

The robots are back. And this time they look far more like Karel Capek’s idea than the mechanical arms that represent much of the real world’s robot business. At least, those are the ones in the labs. Few see human-like robots as a near-term commercial success. But the research has seen some dramatic changes over the past couple of years, driven by a combination of machine learning and accelerated simulation.

Last year, Gartner put a timescale of at least a decade on humanoid droids making it to mainstream use. But the end of Nvidia CEO Jensen Huang’s keynote at March’s GTC conference saw a two-legged robot built by Disney Research waddle onto the stage, burbling away like its Star Wars counterparts, as a demonstration of how robots that look more like their sci-fi forerunners are appearing in real life. Or at least in theme parks.

It is little more than a walking battery with cute gestures and sound effects. And humans keep the BDX robots from getting into trouble by controlling them remotely. But they will react seemingly naturally to a head scratch. Greater autonomy will give them a wider range of gestures and interactions, though the company expects to have humans stay in control long term if only to make them stay in character.

The BDX is a low-touch robot. It follows machines like Aldebaran’s Pepper that were basically screens on wheels. And, as such, the BDX machines do not need to work out how to deal with one of the biggest obstacles that mainstream robots face today: manipulating things in a chaotic world.

Industrial robots have relied on static, unchanging objects sitting on a production line that a single arm can move around. The suction cups of Ambi Robotics’ stacking robot for logistics deals with regular cardboard boxes as though it is playing a 3D version of Tetris. But the trickier problem of having a robot move itself to a truck loading dock and use grippers to take stock in and out of a warehouse represents another level of complexity.

Then there are the sheer number of actuators a full humanoid robot needs. According to Moritz Bächer, associate lab director of Disney’s team in Zurich, with just two legs, a BDX robot needs 14 actuators. Each leg needs five degrees of freedom to waddle around with the remaining motors used to move the neck joints. The designers chose not to put an additional motor in each ankle that would let the BDX perform more advanced steps, such as balancing on one leg. Going to full humanoid designs may mean using more than two dozen individual motors.

Dramatic progress

Getting the robot to combine its many actuators to do what the user wants presents the other obstacle. Yet this higher-level problem is where robot makers have seen dramatic progress. It is one reason why Nvidia has become a keen proponent of the robot market. The market now supports not just one of the GPU’s selling points, but two.

Traditionally, simulation provided a way to help designers tune algorithmic limb-control strategies, such as model predictive control, before risking damage in the real world. That is giving way to machine learning as variants of the large language models that underpin today’s chatbots move in.

At the Actuate robotics conference last autumn, Brad Porter, CEO and founder of Collaborative Robots, said: “If you had asked five years ago is it more likely that we will have robots that could clear the dishes from a table or is it more likely that we would have AI that understood semantically what it meant, I would have said semantically understanding that query would have been much harder than actually manipulating and clearing the table.”

Multimodal or visual language models coupled with reinforcement learning have changed how many R&D teams approach the problem. The training regime has made it possible to take another mainstay of the modern entertainment business and apply it to machine control: motion capture.

Above: Two-legged robots built by Disney Research waddle onto the stage at Nvidia's GTC conference Credit: Nvidia

Demonstrations by Chinese company Unitree of its G1 robot performing backflips and martial arts moves stem from training an AI model to replicate the moves captured by cameras and other sensors and converted into kinematic models that run in a simulator. Disney’s team has used a similar approach to even its robots to try to balance on one leg even when they do not have the ankle motor that would make the manoeuvre more feasible.

Such techniques demand a lot of data. A lot comes from logs collected during field trials and operations. “It’s what we call the data flywheel. All our working systems are sending new data, real data.”

Ambi has collected some 200,000 hours of camera recordings from its warehouse transport and parcel stacking robots that are then recycled for further training. But he regards this quantity as tiny compared to the huge corpus of text used by OpenAI and others on their language models.

Simulation, the other target for GPUs, can provide a lot more training data by synthesising it. And potentially do everything a lot faster. Hardware acceleration can deliver the faster-than-real-time reinforcement cycles. Using Nvidia’s IsaacSim engine, Disney has been training several thousand virtual robots in parallel to cut down training time.

Simulation translation challenges

How well the simulation translates to the real world depends, naturally, on the fidelity of the model. At GTC, Ken Goldberg, chief scientist of Ambi and professor of engineering at the University of California at Berkeley, said simulation can transfer well into the real world, particularly for airborne drones. But rough or sandy ground, ice and wet pavements present bigger challenges. It is one reason the Newton project emphasises real-world physics. The animation of Blue that led up to the physical robot appearing on stage at GTC showed it wading through a desert of rough gravel.

Disney, Google DeepMind and Nvidia are working on a physics-driven simulation environment called Newton that could speed up robot development by doing far more in the virtual world before transferring the skills obtained into real-world machines. “We need a physics engine that is designed for very fine-grained rigid and soft bodies and designed to train tactile feedback and fine motor skills. We need it to be GPU-accelerated so that these virtual worlds could live in super-linear time [to] train these models incredibly fast,” Huang says.

To work reliably around the home or outside controlled workplace environments, robots need higher-level policies that translate a semantic understanding of the command “fold that laundry” into a series of actions that lead to a pile of neatly folded clothes.

That could come from the same switch in strategies that led to language models like OpenAI’s o1 and Deepseek R1. The idea is to have robots attempt to mimic reasoning for high-level skills and have low-level motion training put them into action.

Though language models have delivered rapid advances in robot skills, researchers point to their tendency to hallucinate. Huang argues the deeper physics of environments like Newton provide a more robust environment for assessing how well higher-level robot policies are likely to perform in the real world as they provide immediate feedback on moves that cannot work. This kind of work will translate later in the year to a foundation model called Cosmos Reason.

However, the experience of programmers with AI assistants has shown how language models can suddenly do the unexpected because their pretraining draws on disparate sources of data, not all of which are helpful. Robot training may require far more attention to data sources to prevent problems, perhaps working with more traditional algorithmic controllers as backups to constrain actions that the lower-level physics model cannot catch.

“It is important to have your eyes open about robotics,” says Goldberg. “We will get there one step at a time.”