Fresh Hacker News | Google DeepMind's Aloha Unleashed is pushing the boundaries of robot dexterity

▲Google DeepMind's Aloha Unleashed is pushing the boundaries of robot dexterity(twitter.com)

230 points by modeless 13 days ago | 27 comments

▲modeless 13 days ago

More videos:

Hanging multiple shirts in a row: https://twitter.com/ayzwah/status/1780263770440491073

Generalizing to unseen sweater: https://twitter.com/ayzwah/status/1780263771858194809

Struggling to unfold a shirt: https://twitter.com/DannyDriess/status/1780270239185588732

Assembling gears: https://twitter.com/ayzwah/status/1780263775213629497

▲godelski 13 days ago

I wish more of these were shown at 1x (last one is). Sure, a bit slower, but if you watch the OP link at 1/2 speed it is still impressive.

Here's a shoe tying at 1x: https://twitter.com/ayzwah/status/1780263776694182311

Interesting here how it ties the knot. The first knot is already in place and they just do the bow. I don't think how most people would tie their shoes would work well for a robot (bunny around tree method[0]) but I actually tie my shoes like this[1]. This is the same way the robot ties.

So I gotta ask, was this a purely learned policy or was this taught or pushed in that direction[^]? I suspect the latter.

[0] https://www.youtube.com/watch?v=YwqQvKtmefE

[1] https://www.youtube.com/watch?v=XPIgR89jv3Q

[^] by pushed in that way, would include watching that video or any other videos like it.

▲polygamous_bat 13 days ago

This is an imitation learning policy running on the robot, which means that the robot was taught by hours and hours of humans controlling it to do the task. The robot didn’t learn to do anything from scratch.

▲godelski 12 days ago

Thanks for confirming. I suspected but hard to tell who is saying they do things from scratch (through sims) or imitation.

▲GuB-42 13 days ago

Here is the source for that knot (as well as many others)

https://www.fieggen.com/shoelace/ianknot.htm

▲fragmede 13 days ago

UBTECH and Baidu out of China demoed a clothes folding robot demo two weeks ago (early April 2024), and the video is claimed to be 1x/realtime.

https://youtu.be/8MRDF2pkIRs

▲modeless 13 days ago

Cool video. I would say that the clothes folding task in the UBTECH video is much, much easier than the ones in Google's videos. In fact, it could potentially be performed by simple replay of scripted motions with no sensing at all (with low reliability). I have some things I always look for when I'm watching robot demo videos:

1. Are there cuts in the video? If so, the robot may not be able to perform the entire task by itself without help. UBTECH's video has a couple of cuts. Google's videos have none.

2. Is the video sped up? If so, the robot may be very slow. UBTECH's video is 1x which is good, but you can see that the robot does move somewhat slowly and does not switch fluidly between actions. Google posted both 1x and 2x-20x videos so that you can easily see both real time speed and long duration reliability. In the 1x videos Google's robot is also somewhat slow, however it seems to switch more fluidly between actions than UBTECH.

3. Is the initial state at the start of the video realistic? If not, the robot may not be able to start the task without help. UBTECH's video starts with a carefully folded and perfectly flat shirt already in the hands of the robot. Google's videos start with shirts relatively messily placed on tables and somewhat crumpled.

4. Is the task repeated? If not, the robot may be very unreliable at finishing the task. Google's videos show a lot of repetition without cuts. UBTECH's video shows only one instance (with cuts). You could still produce this video even if the UBTECH robot fails 90% of the time.

5. Is there variation in the repeated tasks? If not, the robot may fail if there is any variation. Google shows different colors and initial states of shirts, and also a much larger sweater. That said, almost all the shirts are small polo shirts and the robot would certainly not generalize to anyone's real closet of varied clothes.

6. Does the robot react to mistakes or other unexpected events? If not, it may be mostly playback of pre-recorded motions with little or no sensing influencing the robot's behavior. UBTECH's video shows alleged sensing, but doesn't show any unexpected things or mistakes. Google's videos show the robot recovering from mistakes.

▲fragmede 11 days ago

good points! it's a demo video, so I assume it's staged, as are all of them, but I assume it's not cgi.

▲readyplayernull 13 days ago

> 1x/realtime

You can see at 1:05 how the man suddendly accelerates.

▲fragmede 12 days ago

good catch!

▲NicoJuicy 13 days ago

Yeah, some weird things. Eg. Folding the shirt already in his hands, ...

Like it can't pick it up with 2 hands?

Some other weird things in that video too

▲hammock 13 days ago

At the end of the OP video, why does the right arm grab the hanger then hand it to the left arm? Seems like a highly unnecessary and inefficient step

▲pulvinar 13 days ago

There will always need to be a "wasted" step. The hanger comes off the rack with the hook away from the robot and has to go back on that way. Since there are two shoulders of the hanger to put into the shirt, they have to hand the hanger to from one to the other, and that would leave the hook facing the last robot if they don't do this extra step.

▲hammock 13 days ago

Couldn't the left arm just take the hanger off the rod?

▲pulvinar 13 days ago

I don't see that it helps, since the task is symmetrical. Whichever arm takes it off has to give it to the other arm at some point so both sides of the hanger are used, and it needs to be given back either before or after doing the hanging so the hanger's hook is back on the outside (facing away from the arm holding it), before replacing it on the rack.

Would be interesting to know if the AI figured this out (the hard way, I'm sure).

▲hammock 12 days ago

Not sure we are talking about the same thing. I'm specifically talking about 1:07-1:10. What is the reason why the left arm could not take the hanger off directly, without involvement of the right arm?

▲pulvinar 10 days ago

It could, and then the hanger would be facing the other direction, and everything would simply be mirrored left-right. The last arm to hold the hanger still has to replace it on the rack, which means it needs the hanger's hook facing away from it, which only happens when there have been an even number of exchanges between arms. Thus the need for that extra exchange.

That extra exchange could have been done after putting the shirt on the hanger, but that would have been more risk for it falling off again.

▲yosito 13 days ago

There's something odd about the way the arms move, like they are two distinct entities cooperating rather than being part of one coordinated mind. Maybe this is an example of the uncanny valley, or maybe it's because they are two physically separate arms, but it seems to me like one arm moves while the other waits for its turn. It's as if engineers programmed them to work sequentially. I wonder if it might be beneficial for engineers to study videos of humans doing these tasks and try to mimic those movements rather than trying to program a sequential procedure.

▲patcon 13 days ago

Now I'm trying to imagine how our limb movements might be perceived by a creature that natively evolved the style of coordination in the video :) it would be "weird" but how might they describe that weirdness and what might underlie it in us..?

▲dylan604 13 days ago

Look, it moves its mouth while it reads. Like it can't do one thing without doing the other thing moving at the same time

▲pixl97 13 days ago

Which reminds me of my favorite interpretation of "They're made out of meat"

https://www.youtube.com/watch?v=7tScAyNaRdQ

▲visarga 13 days ago

this was good

▲williamcotton 13 days ago

If you monitor your own movements you’ll find plenty of sequential procedures. The big difference with how these robot arms move is that they are firmly planted on a large table, whereas your arms attached to this self-balancing, lightweight gyrating torso.

▲wantsanagent 13 days ago

The policy is trained from human demonstrations. When you're teleoperating a robot you tend to "not do anything" with the arm you're not actively focusing on unless you need to do bi-manual operations.

▲yosito 13 days ago

This would make sense. If I'm tele-operating a robot, I'm going to be much less coordinated than if I just use my own two hands to hang a tshirt.

▲lachlan_gray 13 days ago

Sometimes I feel this about myself... I don't think much to walk or do something with both hands, they work stuff out on their own. How much do my legs or hands understand about each other?

▲voxelizer 13 days ago

I feel we tend to coordinate movements to balance ourselves, specially with arms. In this case, the arms are independent from each other and are each firmly fixed to the table.

▲rotexo 13 days ago

I’ve been reading Vernor Vinge’s A Fire Upon the Deep where that is a characteristic of one of the species in the novel, and had the exact same thought.

▲QuercusMax 13 days ago

The Tines are such a fascinating concept of how a pack intelligence could work. That whole universe has so many interesting ideas.

▲CooCooCaCha 13 days ago

That's because the robot has gone ultra instinct.

▲macromaniac 13 days ago

It's impressive that transformers, diffusion, and human generated data can go so far in robotics. I would have expected simulation would be needed to achieve such results.

My fear is that we see a similar problem with other generative AI in that it gets stuck in loops on complex problems and is unable to correct itself because the training data covers the problem but not the failure modes.

▲visarga 13 days ago

That's because most models have been trained on data created by humans for humans, it needs data created by AI for itself. Better learn from your mistakes than from the mistakes of others, they are more efficient and informative.

When an AI is set up to learn from its own mistakes it might turn out like AlphaZero, who rediscovered the strategy of Go from scratch. LLMs are often incapable of solving complex tasks, but they are greatly helped by evolutionary algorithms. If you combine LLMs with EA you get black box optimization and intuition. It's all based on learning from the environment, interactivity & play. LLMs can provide the mutation operation, or function as judge to select surviving agents, or act as the agents themselves.

▲3327 13 days ago

[dead]

▲btbuildem 13 days ago

I guess this is obvious in retrospect.. but having two arms vs one greatly expands the range of possible tasks.

▲baron816 13 days ago

Skin is also incredible when you think about it. Each square cm is able to sense temperature, texture, whether it’s wet, sticky, etc. And it’s self healing. It’s hard to imagine robots getting very far without artificial skin.

▲reaperman 13 days ago

Your point is entirely valid despite my critiques.

Technically, skin can't sense whether something is wet, and isn't particularly great at sensing temperature. Skin senses pressure and heat flow (derived via sensing temperature change of the flesh itself, rather than the temperature of the object it is touching), and perhaps can sense shear (there is a unique sensation when skin is stretched/pulled apart), as well as the weight of an object (if it is absorbent and more wet than damp). This distinction about what skin can directly sense manifests itself to deceive the human brain about wetness and temperature, specifically.

Wetness is a perception derived from feeling higher-than-expected heat loss and unusual pressure/sheer, and even through the sound made when squeezing an absorbent material or the sensation of water pooling around the finger (broadening the area of heat loss) when you squeeze into the material. Damp laundry at room temperature is perceived as obviously wet because it feels colder than it should if it were dry, but when we're pulling laundry out of a dryer we often can't tell if it's dry vs. still a bit damp -- the higher temperature of the object removes the sensation of heat flowing away from our fingers, so there's nothing our fingers can sense to tell us the clothes aren't dry until the clothes finally cool down to room temperature.

Our skin also doesn't sense the temperature of an object well if that object has a particularly high or low heat transfer coefficient of conduction. I recently bought a 6-pack of beer cans which have a moderately thick plastic vinyl label shrunk around the can. When I reach in my fridge, I can't convince myself to perceive it as chilled no matter how hard I try. Even though the vinyl is the same temperature as everything else in the fridge, it doesn't pull heat out of my finger tissue, so my brain cannot perceive that it isn't "room temperature". Conversely, picking up a normal metal can of beer that is just barely below room temperature, my brain perceives it to be much colder than it actually is because the metal draws heat away from my fingers so quickly compared to other objects. If wood is cooled 5 degrees below room temperature, it doesn't feel cold, but a can of beer certainly does!

It is absolutely incredible that our skin can sense things to such a high resolution that it seems like we have a lot more abilities than we actually have. It is also amazing how our brain integrates this into a rich perception. But there actually aren't many physical properties actually being measured, and this distinction matters sometimes for edge cases, some of which are quite common.

▲pixl97 13 days ago

> skin can't sense whether something is wet

Ah the "Are the clothes in the dryer cold or are they wet" effect.

▲yakz 13 days ago

Skin can’t sense “wet” can it? I thought it was mostly just temperature, but also in combination with a few other properties you perceive it as moisture but it can be easily fooled because there’s not a direct sense for it.

▲CooCooCaCha 13 days ago

I think you answered your own question.

▲mikepurvis 13 days ago

True, and I think it was on that basis that PR2 was conceived as a bi-manual mobile manipulator... it just also has a massive impact on cost.

▲ 13 days ago

▲netcan 13 days ago

Ooh! it can almost fold a shirt.

Shade aside, robotics is so damned hard.

The current under/over for godlike superintelligence before a robot that can make sandwiches and work the laundry machine... So unintuitive.

▲nabla9 13 days ago

Look at the grapples it has to work with.

If you would have to work with two chopsticks, or two spanners as hands, you would not do any better.

▲tho23i43423434 13 days ago

Actually, you typically don't use all the dexterity of a five-fingered hand for most tasks.

Opposing-digits gets you 90% of the way there, which is why when you finally start using fingers independently (say while playing the violin), a large amount of time is spent getting around the learnt 'synergies'.

Unclear why we actually evolved fingers in the first place (balance ?).

▲nabla9 12 days ago

It's not the number of fingers I'm talking about.

Human two fingers are much better gapplers than what that machine has. Fingers can feel pressure, have multiple joints, have soft pads.

▲moffkalast 13 days ago

The neat part about working in robotics is that nobody can tell if you're a genius or a moron because neither of the two can get the damn thing working properly.

▲YeGoblynQueenne 12 days ago

As usual, people get over-excited about videos that seem to be showing robots doing something useful, when in truth they're doing something that shows how far away we are from robotic autonomy. It's like the dancing robot videos by Boston Dynamics all over again.

The Aloha systems in this video are just as scripted as Boston Dynamics robots. The difference is in how the robots' behaviours are scripted: instead of hard-coding Finite State Controllers, the behaviours are programmed by demonstration, through teleoperation.

The end result is the same: the final system can do one thing, and do it well enough to put in a video, but it can't deal with changes to the environment or to the initial setup of the task.

For instance, in the shoe-lace tying part of the video, the shoe is already placed neatly in between the two arms and only at a small angle away from the vertical, the laces are pulled to the side and of course the knot is already half-tied. If you changed those initial parameters by a few units, placed the shoe further towards the top or bottom of the table, had both laces on one side, or turned the shoe in a right angle to the arms, the system would fail.

Despite the "generalisation to sweater" video, there's very little flexiblity and very little generalisation, and that's still a system that can perform a handful of discrete tasks (swap gripper, tie shoelaces, hang shirt). That's not a system that can function autonomously in the real world.

No robot maids or android valets coming in a shop near you any time soon, I'm afraid. In the foreseeable future the most successful autonomous systems will remain self-guided munition, which are made to be destroyed on impact and cause maximum damage.

▲ozten 13 days ago

The speed that these arms/hands move at is incredible compared to 4 months ago.

▲chabons 13 days ago

The videos are all shown at 2x speed, but your point stands, this is still pretty quick.

▲quux 13 days ago

Unexpected Ian Knot https://www.fieggen.com/shoelace/ianknot.htm

▲sheepscreek 13 days ago

What makes this special to me is their claim of using cheap robots. So, how much would one of those robo arms cost?

In other words, what would it set me back to recreate this?

▲krasin 13 days ago

Here is the tutorial how to recreate an ALOHA cell: https://docs.google.com/document/d/1sgRZmpS7HMcZTPfGy3kAxDrq...

The robot arm kits are available commercially: https://www.trossenrobotics.com/aloha-kits

It's my belief that 10x cheaper arms with the same performance are possible, and the only reason they don't exist is because nobody needs them in sufficient quantities.

▲jjjjjjjkjjjjjj 13 days ago

I've worked in humanoid robots and manipulation for the past decade and this is mind blowing. For robots. Still pathetic compared to any human, but mind blowing for robots. I remember when we were hoping one humanoid would someday be able to replace a broken limb on another humanoid and we were designing super easy quick disconnects to make that possible. This is already way beyond that. Very impressive.

▲fragmede 13 days ago

have you seen https://youtu.be/8MRDF2pkIRs ?

▲im3w1l 13 days ago

To me the most impressive thing is the arms servicing each other. When they can self replicate it could potentially have big consequences.

I have a dream that we put self-replicating robots on Mars and let them build a mostly by-robots for-robots civilization that can potentially export stuff to earth, do various science projects and build spacecraft.

▲adolph 13 days ago

It isn't clear how "Aloha Unleashed" is different from "Mobile ALOHA"

Paper: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation: https://arxiv.org/abs/2401.02117

Video set: https://mobile-aloha.github.io/

Tutorial: https://docs.google.com/document/d/1_3yhWjodSNNYlpxkRCPIlvIA...

Kits for sale: https://www.trossenrobotics.com/aloha-kits

▲ingend88 13 days ago

Is there ready made low-cost arm that is available ?

▲adolph 13 days ago

You have to think of this as an entire system. The arm is necessary but not sufficient. An "arm" could be as simple as small servos and popsicle sticks [0]. In the case of ALOHA, below is an outline of the basic components.

  * arms (aka follower arms)
    - effector (i.e. gripper)
    - sensors (i.e. cameras, depth sensors, specced Intel RealSense D405)
    - gravity compensation (so the relatively delicate servos aren't overloaded)

  * controller
    - runs Robot Operating System (ROS [1]) plus other software (i.e. arm, gripper interfaces [2])
    - runs ALOHA model in inference to tell ROS what to do based on task and sensor input
    - trains ALOHA models using arm motion encoder and ACT: Action Chunking with Transformers [4]
  
  * leader arms
    - motion encoders (essentialy an arm in reverse that can be used by a human to telecontrol the arm to encode motions into model training)

The system at this point is "research grade" which is at once expensive due to custom/nice materials/units and not super user friendly--you must know a lot. See the build instructions [5].

0. https://github.com/evildmp/BrachioGraph

1. https://www.ros.org/

2. https://github.com/interbotix

3. https://www.trossenrobotics.com/aloha-kits

4. https://github.com/tonyzhaozh/act

5. https://docs.google.com/document/d/1sgRZmpS7HMcZTPfGy3kAxDrq...

▲bastawhiz 13 days ago

This makes me wonder whether Google regrets spinning Boston Dynamics back off as its own entity.

▲throwaway29303 13 days ago

https://www.theverge.com/2017/6/8/15766434/alphabet-google-b...

▲modeless 12 days ago

They should. If they don't yet, I bet they will in a few years. One of the most shortsighted decisions Google has ever made.

▲margorczynski 13 days ago

Question is how well it handles unseen variations and environments? This is a well controlled lab setup, with probably cherry-picked videos showing positive results.

▲riidom 13 days ago

Reminds me about "Foxes in Love" somehow.

▲taylorfinley 13 days ago

I wonder if this unfortunate naming choice will cause a stir similar to: https://kawaiola.news/cover/aloha-not-for-sale-cultural-in-a...

▲bastawhiz 13 days ago

The story you linked either omits the information or buries it deep enough to obscure the _actual_ source of the controversy. I was living in Chicago at the time, and the scandal wasn't the name choice, it was the fact that Aloha Poke sent cease and desist letters to other poke shops across the country demanding that they remove "aloha" from their names:

https://chicago.eater.com/2018/7/31/17634686/aloha-poke-co-c...

> the Chicago-born restaurant chain whose attorneys sent cease and desist messages to poke shop owners in Hawai’i, Alaska, and Washington state demanding they change names by dropping the terms “aloha” and “poke” when used together. While Aloha Poke contends it sent notes in a “cooperative manner” to defend intellectual property, Native Hawaiians feel the poke chain is trying to restrict how they can embrace their own heritage.

▲math_dandy 13 days ago

Hopefully DeepMind will think twice before sending cease-and-desist orders to any Hawaiian AI robotics businesses with aloha in the name!

▲taylorfinley 13 days ago

I should definitely hope so! Though I think the name would cause a stir in local circles even without any legal actions. Tech companies in general are deeply unpopular here (see: Larry Ellison, Mark Zuckerburg, and Marc Benioff buying up big chunks of land, AirBnB and digital nomads driving up rental prices so high such that more native Hawaiians now live on the mainland than in Hawai`i, and perceived lack of cultural respect from projects like the Thirty Meter Telescope leading to major protests).

The other thing is that words have a lot of power in the cultural frame, even just the concept of aloha being something that could be "unleashed" is likely to offend.

All to say nothing off the palpable fear people have here of robots taking hospitality industry jobs like housekeeping (which are unionized in many hotels out here, and are actually one of the few low-barrier-to-entry jobs out here that can support a reasonable quality of life)

I'm sure I'll get a ton of downvotes for bringing up cultural sensitivity and pointing out these concerns -- I don't mean to imply they're all 100% rational nor that no one should say "aloha" unless they're Hawaiian, but if anyone at DeepMind had a Hawaiian cultural frame I think they likely would have flagged these concerns and recommended a different name.

▲1024core 13 days ago

> Tech companies in general are deeply unpopular here

Which is such a shame, as Univ of Hawaii was one of the pioneers of the Internet: https://en.wikipedia.org/wiki/ALOHAnet

▲ 13 days ago

▲throwup238 13 days ago

Finally, a robot that can tie my shoes for me!

▲linsomniac 13 days ago

In the last year I've started using that knot that those robots used, the "Ian Knot", to tie my shoes, and I'm loving it. https://www.fieggen.com/shoelace/ianknot.htm

▲mjamesaustin 13 days ago

Yeah these robots tie a better shoelace than most humans!

▲p1mrx 13 days ago

Do most humans leave their laces dragging on the ground?

Though to be fair, those laces are really long. The robot needs to unlace the shoes, cut some length from the middle, tie a double fisherman's knot, and relace them.

▲n0us 13 days ago

Can't wait for someone to turn this into a product and make this available to the public!

▲a_wild_dandan 13 days ago

The Laundry Folding Helping Hands will sell so goddamn hard. When the tech gets there, I'll be first in line. I'll even buy the Vegetable Chopping DLC.

▲lyapunova 13 days ago

Sorry, but this is a lot of marketing for the same thing over and over again. I'm not against Aloha as an _affordable_ platform, but skimping on hardware is kind of a bug not a feature. Moreover it's not even _lowcost_, its BoM is still like 20k and collecting all the data is labor intensive and not cheap.

And if we're focusing on the idea, it has existed since the 1950s and they were doing it relatively well then:

https://www.youtube.com/watch?v=LcIKaKsf4cM

▲xg15 13 days ago

> skimping on hardware is kind of a bug not a feature.

I have to disagree here. Not for 20k, but if you could really build a robot arm out of basically a desk lamp, some servos and a camera and had some software to control it as precisely as this video claims it does, this would be a complete game changer. We'd probably see an explosion of attempts to automate all kind of everyday household tasks that are infeasible to automate cost-effectively today (folding laundry, cleaning up the room, cooking, etc)

Also, every self-respecting maker out there would probably try to build one :)

> And if we're focusing on the idea, it has existed since the 1950s and they were doing it relatively well then:

I don't quite understand how the video fits here. That's a manually operated robot arm. The point of Aloha is that it's fully controlled by software, right?

▲YeGoblynQueenne 12 days ago

If you want a robot that can fold your laundry, clean your room and cook, you need a lot more than cheap hardware. You need an autonomous agent (i.e. "an AI") that can guide the hardware to accomplish the task.

We're still very far from that and you certainly can't do that with ALOHA, in practice, despite what the videos may seem to show. For each of the few, discrete, tasks that you see in the videos, the robot arms have to be trained by demonstration (via teleoperation) and the end result is a system that can only copy the operator's actions with very little variation.

You can check this in the Mobile ALOHA paper on arxiv (https://arxiv.org/abs/2401.02117) where page 6 shows the six tasks the system has been trained to perform, and the tolerances in the initial setup. So e.g. in the shrimp cooking task, the initial position of the robot can vary by 10cm and the position of the implements by 2cm. If everything is not set up just so, the task will fail.

What all this means is that if you could assemble this "cheap" system you'd then have to train it by a few hundred demonstrations to fold your laundry, and maybe it could do it, probably not, and if you moved the washing machine or got a new one, you'd have to train all over again.

As to robots cleaning up your room and cooking, those are currently in the realm of science fiction, unless you're a zen ascetic living in an empty room and happy to eat beans on toast every day. Beans from a can, that is. You'll have to initialise the task by opening the can yourself, obviously. You have a toaster, right?

▲xg15 12 days ago

> If you want a robot that can fold your laundry, clean your room and cook, you need a lot more than cheap hardware. You need an autonomous agent (i.e. "an AI") that can guide the hardware to accomplish the task.

Yes, that's my point. Cheap hardware is far harder to control than expensive hardware, so if Google actually developed some AI that can do high-precision tasks on "wobbly", off-the-shelf hardware, that would be the breakthrough.

I agree that extensive training for each single device would be prohibitive, but that feels like a problem that could be solved with more development: With many machine learning tasks, we started with individual training a model for each specific use case and environment. Today we're able to make generalized model which are trained once and can be deployed in a wide variety of environments. I don't see why this shouldn't be possible for a vision-based robot controller either.

Managing the actual high-level task is easy as soon as you're able to do all the low-level tasks: I.e., converting a recipe into a machine-readable format, dividing it into a tree of tasks and subtasks etc is easy. The hard parts are actually cutting the vegetables, de-boning the meat, etc. The amount of complex movement planning necessary for that doesn't exist yet. But this project looks as if it's a step in exactly that direction.

▲modeless 13 days ago

These videos are all autonomous. They didn't have that in the 1950s.

▲lyapunova 13 days ago

I can appreciate that, but also they are recording and replaying motor signals from specific teleoperation demonstrations. Something that _was_ possible in the 1950s. You might say that it is challenging to replay demonstrations well on lower-quality hardware. And so there is academic value in trying to make it work on worse hardware, but it would not be my goto solution for real industry problems. E.g. this is not a route I would fund for a startup, for example.

▲modeless 13 days ago

They do not replay recorded motor signals. They use recorded motor signals only to train neural policies, which then run autonomously on the robot and can generalize to new instances of a task (such as the above video generalizing to an adult size sweater when it was only ever trained on child size polo shirts).

Obviously some amount of generalization is required to fold a shirt, as no two shirts will ever be in precisely the same configuration after being dropped on a table by a human. Playback of recorded motor signals could never solve this task.

▲adolph 13 days ago

> recorded motor signals only to train neural policies

Is interesting that they are using "Leader Arms" [0] to encode tasks instead of motion capture. Is it just a matter of reduced complexity to get off the ground? I suppose the task of mapping human arm motion to what a robot can do is tough.

0. https://www.trossenrobotics.com/widowx-aloha-set

▲YeGoblynQueenne 12 days ago

I appreciate that going from polo shirts to sweaters is a form of "generalisation" but that's only interesting because of the extremely limited capability for generalisation that systems have when they're trained by imitation learning, as ALOHA.

Note for example that all the shirts in the videos are oriented in the same direction, with the neck facing to the top of the video. Even then, the system can only straighten a shirt that lands with one corner folded under it after many failed attempts, and if you turned a shirt so that the neck faced downwards, it wouldn't be able to straighten it and hang it no matter how many times it tried. Let's not even talk about getting a shirt tangled in the arms themselves (in the videos, a human intervenes to free the shirt and start again). It's trained to straighten a shirt on the table, with the neck facing one way [1].

So the OP is very right. We're no nearer to real-world autonomy than we were in the '50s. The behaviours of the systems you see in the videos are still hard-coded, only they're hard-coded by demonstration, with extremely low tolerance for variation in tasks or environments, and they still can't do anything they haven't been painstakingly and explicitly shown how to do. This is a sever limitation and without a clear solution to it there's no autonomy.

On the other hand, ιδού πεδίον δόξης λαμπρόν, as we say in Greek. This is a wide open field full of hills to plant one's flag on. There's so much that robotic autonomy can't yet do that you can get Google to fund you if you can show a robot tying half a knot.

__________________

[1] Note btw that straightening the shirt is pointless: it will straighten up when you hang it. That's just to show the robot can do some random moves and arrive at a result that maybe looks meaningful to a human, but there's no way to tell whether the robot is sensing that it achieved a goal, or not. The straightening part is just a gimmick.

▲klowrey 13 days ago

We're building software for neuromorphic cameras specifically for robotics. If robots could actually understand motion in completely unconstrained situations, then both optimal control and modern ML techniques would easily see uplift in capability (i.e. things work great in simulation, but you can't get good positions and velocities accurately and at high enough rate in the real world). Robots already have fast, accurate motors, but their vision systems are like seeing the world through a strobe light.

▲ewjt 13 days ago

This is not preprogrammed replay. Replay would not be able handle even tiny variations in the starting positions of the shirt.

▲lyapunova 13 days ago

So, a couple things here.

It is true that replay in the world frame will not handle initial position changes for the shirt. But if the commands are in the frame of the end-effector and the data is object-centric, replay will somewhat generalize.(Please also consider the fact that you are watching the videos that have survived the "should I upload this?" filter.)

The second thing is that large-scale behavior cloning (which is the technique used here), is essentially replay with a little smoothing. Not bad inherently, but just a fact.

My point is that there was an academic contribution made back when the first aloha paper came out and they showed doing BC on low-quality hardware could work, but this is like the 4th paper in a row of sort of the same stuff.

Since this is YC, I'll add - As an academic (physics) turned investor, I would like to see more focus on systems engineering and first-principles thinking. Less PR for the sake of PR. I love robotics and really want to see this stuff take off, but for the right reasons.

▲modeless 13 days ago

> large-scale behavior cloning (which is the technique used here), is essentially replay with a little smoothing

A definition of "replay" that involves extensive correction based on perception in the loop is really stretching it. But let me take your argument at face value. This is essentially the same argument that people use to dismiss GPT-4 as "just" a stochastic parrot. Two things about this:

One, like GPT-4, replay with generalization based on perception can be exceedingly useful by itself, far more so than strict replay, even if the generalization is limited.

Two, obviously this doesn't generalize as much as GPT-4. But the reason is that it doesn't have enough training data. With GPT-4 scale training data it would generalize amazingly well and be super useful. Collecting human demonstrations may not get us to GPT-4 scale, but it will be enough to bootstrap a robot useful enough to be deployed in the field. Once there is a commercially successful dextrous robot in the field we will be able to collect orders of magnitude more data, unsupervised data collection should start to work, and robotics will fall to the bitter lesson just as vision, ASR, TTS, translation, and NLP before.

▲YeGoblynQueenne 12 days ago

"Limited generalisation" in the real world means you're dead in the water. Like the Greek philosopher Heraclitus pointed out 2000+ years go, the real world is never the same environment and any task you want to carry out is not the same task the second time you attempt it (I'm paraphrasing). The systems in the videos can't deal with that. They work very similar to industrial robots: everything has to be placed just so with only centimeters of tolerance in the initial placement of objects, and tiny variations in the initial setup throw the system out of whack. As the OP points out, you're only seeing the successful attempts in carefully selected videos.

That's not something that you can solve with learning from data, alone. A real-world autonomous system must be able to deal with situations that it has no experience with, it has to be able to deal with them as they unfold, and it has to learn from them general strategies that it can apply to more novel situations. That is a problem that, by definition, cannot be solved by any approach that must be trained offline on many examples of specific situations.

▲lyapunova 13 days ago

Thank you for your rebuttal. It is good to think about the "just a stochastic parrot" thing. In many ways this is true, but it might not be bad. I'm not against replay. I'm just pointing out that I would not start with an _affordable_ 20k robot with fairly undeveloped engineering fundamentals. It's kind of like trying to dig a foundation to your house with a plastic beach shovel. Could you do it? Maybe, if you tried hard enough. Is it the best bet for success? doubtful.

▲klowrey 13 days ago

The detail about end-effector frame is pretty critical as doing this BC with joint angles would not be tractable. You can tell there was a big shift from the RL approaches trying to do very generalizing algorithms to more recent works that are heavily focused on this arms/manipulators because end-effector control enables more flashy results.

Another limiting factor is that data collection is a big problem: not only will you never be sure you've collected enough data, they're collecting data of a human trying to do this work through a janky teleoperation rig. The behavior they're trying to clone is of a human working poorly, which isn't a great source of data! Furthermore limiting the data collection to (typically) 10Hz means that the scene will always have to be quasi-static, and I'm not sure these huge models will speed up enough to actually understand velocity as a 'sufficient statistic' of the underlying dynamics.

Ultimately, it's been frustrating to see so much money dumped into the recent humanoid push using teleop / BC. It's going to hamper the folks actually pursing first-principles thinking.

▲YeGoblynQueenne 12 days ago

BC = Behavioural Cloning.

>> It's going to hamper the folks actually pursing first-principles thinking.

Nah.

▲modeless 13 days ago

What's your preferred approach?

▲johntb86 13 days ago

What do you mean by saying that they're replaying signals from teleoperation demonstrations? Like in https://twitter.com/DannyDriess/status/1780270239185588732, was someone demonstrating how to struggle to fold a shirt, then they put a shirt in the same orientation and had the robot repeat the same motor commands?

▲YeGoblynQueenne 12 days ago

Yeah, it's called imitation learning. Check out the Mobile ALOHA paper where they explain their approach in some detail:

https://arxiv.org/abs/2401.02117

▲sashank_1509 13 days ago

I follow this space closely and I never saw the 1950 teleoperation video which literally blows my mind that people had this working in 1950. Now you just need to connect that to a transformer / diffusion and it will be able to perform that task autonomously maybe 80% of the time with 200+ demonstrations and close to 100% of the time with 1000+ demonstrations.

Aloha was not new, but it’s still good work because robotics researchers were not focused on this form of data collection. The issue was most people went into the simulation rabbit hole where they had to solve sim-to-real.

Others went into the VR handset and hand tracking idea, where you never got super precise manipulations and so any robots trained on that always showed choppy movement.

Others including OpenAI decided to go full reinforcement learning foregoing human demonstrations which had some decent results but after 6 months of RL on an arm farm led by Google and Sergey Levine, the results were underwhelming to say the least.

So yes it’s not like Aloha invented teleoperation, they demonstrated that using this mode of teleoperation you could collect a lot of data that can train autonomous robot policies easily and beat other methods which I think is a great contribution!

▲markisus 13 days ago

I’m not sure you can say that imitation learning has been under-researched in the past. Imitation learning has been tried before alongside RL. But it did not generalize well until the advent of generative diffusion models.

▲occasionallybit 13 days ago

[flagged]

▲RobotToaster 13 days ago

I assume, being google, none of this is going to be open source?

▲krasin 13 days ago

It's already open-source; most of it anyway:

1. https://github.com/tonyzhaozh/aloha

2. https://aloha-2.github.io/

3. https://github.com/tonyzhaozh/aloha/tree/main/aloha2

▲RobotToaster 12 days ago

Thanks, that's helpful[0], I must have missed that.

[0]Unlike the people who downvoted me for asking a question.

▲we_love_idf 13 days ago

Google's days are numbered. OpenAI showed that AI is about delivering AGI, not playing some board games and doing PR stunts. Unfortunately Google hasn't learned its lessons. It's still doing PR stunts and people are falling for it.

▲m3kw9 13 days ago

Don’t look impressive because this is what you see a lot in factories anyways, maybe a little better then Sota

▲danpalmer 13 days ago

The difference with factories is that every movement is programmed by someone in quite intricate detail. Factory robots aren't "smart" in any sense.

▲YeGoblynQueenne 12 days ago

So is this, except the movements are "programmed" by demonstration and with a very small tolerance for variation in the initial placement of objects. It's not that different from factory robots, and it won't be folding your laundry any time soon.

▲adrr 13 days ago

Factories/Distribution Centers are doing hard goods not soft goods.

▲smusamashah 13 days ago

This is very exciting but because its from Google this tech won't get out of their quarters.

▲dghughes 13 days ago

If I ever move and end up living in a giant concrete warehouse devoid of furniture I'll keep this robot in mind.

▲julienreszka 13 days ago

I am skeptical. Lots of fake demos out there. Can we rule out that it turns out it’s actually remotely controlled by some dude in india, just like the amazon’s « Just walk out »?

▲InPanthera 13 days ago

Amazons was remote controlled?

▲sp332 13 days ago

https://gizmodo.com/amazon-reportedly-ditches-just-walk-out-... "Just Walk Out relied on more than 1,000 people in India watching and labeling videos to ensure accurate checkouts."

▲InPanthera 13 days ago

Was expecting more from google than a robot that can tie shoe laces, wats the use case for this? Toys for the 1%?

▲mikepurvis 13 days ago

I don't think the point is lace-tying; it's to demonstrate what is possible in terms of analogous tasks requiring a similar level of dexterity and environmental adaptability.

In any case, the real start of this show is clearly the shirt hanging.