Hanging multiple shirts in a row: https://twitter.com/ayzwah/status/1780263770440491073
Generalizing to unseen sweater: https://twitter.com/ayzwah/status/1780263771858194809
Struggling to unfold a shirt: https://twitter.com/DannyDriess/status/1780270239185588732
Assembling gears: https://twitter.com/ayzwah/status/1780263775213629497
Here's a shoe tying at 1x: https://twitter.com/ayzwah/status/1780263776694182311
Interesting here how it ties the knot. The first knot is already in place and they just do the bow. I don't think how most people would tie their shoes would work well for a robot (bunny around tree method[0]) but I actually tie my shoes like this[1]. This is the same way the robot ties.
So I gotta ask, was this a purely learned policy or was this taught or pushed in that direction[^]? I suspect the latter.
[0] https://www.youtube.com/watch?v=YwqQvKtmefE
[1] https://www.youtube.com/watch?v=XPIgR89jv3Q
[^] by pushed in that way, would include watching that video or any other videos like it.
1. Are there cuts in the video? If so, the robot may not be able to perform the entire task by itself without help. UBTECH's video has a couple of cuts. Google's videos have none.
2. Is the video sped up? If so, the robot may be very slow. UBTECH's video is 1x which is good, but you can see that the robot does move somewhat slowly and does not switch fluidly between actions. Google posted both 1x and 2x-20x videos so that you can easily see both real time speed and long duration reliability. In the 1x videos Google's robot is also somewhat slow, however it seems to switch more fluidly between actions than UBTECH.
3. Is the initial state at the start of the video realistic? If not, the robot may not be able to start the task without help. UBTECH's video starts with a carefully folded and perfectly flat shirt already in the hands of the robot. Google's videos start with shirts relatively messily placed on tables and somewhat crumpled.
4. Is the task repeated? If not, the robot may be very unreliable at finishing the task. Google's videos show a lot of repetition without cuts. UBTECH's video shows only one instance (with cuts). You could still produce this video even if the UBTECH robot fails 90% of the time.
5. Is there variation in the repeated tasks? If not, the robot may fail if there is any variation. Google shows different colors and initial states of shirts, and also a much larger sweater. That said, almost all the shirts are small polo shirts and the robot would certainly not generalize to anyone's real closet of varied clothes.
6. Does the robot react to mistakes or other unexpected events? If not, it may be mostly playback of pre-recorded motions with little or no sensing influencing the robot's behavior. UBTECH's video shows alleged sensing, but doesn't show any unexpected things or mistakes. Google's videos show the robot recovering from mistakes.
You can see at 1:05 how the man suddendly accelerates.
Like it can't pick it up with 2 hands?
Some other weird things in that video too
Would be interesting to know if the AI figured this out (the hard way, I'm sure).
That extra exchange could have been done after putting the shirt on the hanger, but that would have been more risk for it falling off again.
My fear is that we see a similar problem with other generative AI in that it gets stuck in loops on complex problems and is unable to correct itself because the training data covers the problem but not the failure modes.
When an AI is set up to learn from its own mistakes it might turn out like AlphaZero, who rediscovered the strategy of Go from scratch. LLMs are often incapable of solving complex tasks, but they are greatly helped by evolutionary algorithms. If you combine LLMs with EA you get black box optimization and intuition. It's all based on learning from the environment, interactivity & play. LLMs can provide the mutation operation, or function as judge to select surviving agents, or act as the agents themselves.
Technically, skin can't sense whether something is wet, and isn't particularly great at sensing temperature. Skin senses pressure and heat flow (derived via sensing temperature change of the flesh itself, rather than the temperature of the object it is touching), and perhaps can sense shear (there is a unique sensation when skin is stretched/pulled apart), as well as the weight of an object (if it is absorbent and more wet than damp). This distinction about what skin can directly sense manifests itself to deceive the human brain about wetness and temperature, specifically.
Wetness is a perception derived from feeling higher-than-expected heat loss and unusual pressure/sheer, and even through the sound made when squeezing an absorbent material or the sensation of water pooling around the finger (broadening the area of heat loss) when you squeeze into the material. Damp laundry at room temperature is perceived as obviously wet because it feels colder than it should if it were dry, but when we're pulling laundry out of a dryer we often can't tell if it's dry vs. still a bit damp -- the higher temperature of the object removes the sensation of heat flowing away from our fingers, so there's nothing our fingers can sense to tell us the clothes aren't dry until the clothes finally cool down to room temperature.
Our skin also doesn't sense the temperature of an object well if that object has a particularly high or low heat transfer coefficient of conduction. I recently bought a 6-pack of beer cans which have a moderately thick plastic vinyl label shrunk around the can. When I reach in my fridge, I can't convince myself to perceive it as chilled no matter how hard I try. Even though the vinyl is the same temperature as everything else in the fridge, it doesn't pull heat out of my finger tissue, so my brain cannot perceive that it isn't "room temperature". Conversely, picking up a normal metal can of beer that is just barely below room temperature, my brain perceives it to be much colder than it actually is because the metal draws heat away from my fingers so quickly compared to other objects. If wood is cooled 5 degrees below room temperature, it doesn't feel cold, but a can of beer certainly does!
It is absolutely incredible that our skin can sense things to such a high resolution that it seems like we have a lot more abilities than we actually have. It is also amazing how our brain integrates this into a rich perception. But there actually aren't many physical properties actually being measured, and this distinction matters sometimes for edge cases, some of which are quite common.
Ah the "Are the clothes in the dryer cold or are they wet" effect.
Shade aside, robotics is so damned hard.
The current under/over for godlike superintelligence before a robot that can make sandwiches and work the laundry machine... So unintuitive.
If you would have to work with two chopsticks, or two spanners as hands, you would not do any better.
Opposing-digits gets you 90% of the way there, which is why when you finally start using fingers independently (say while playing the violin), a large amount of time is spent getting around the learnt 'synergies'.
Unclear why we actually evolved fingers in the first place (balance ?).
Human two fingers are much better gapplers than what that machine has. Fingers can feel pressure, have multiple joints, have soft pads.
The Aloha systems in this video are just as scripted as Boston Dynamics robots. The difference is in how the robots' behaviours are scripted: instead of hard-coding Finite State Controllers, the behaviours are programmed by demonstration, through teleoperation.
The end result is the same: the final system can do one thing, and do it well enough to put in a video, but it can't deal with changes to the environment or to the initial setup of the task.
For instance, in the shoe-lace tying part of the video, the shoe is already placed neatly in between the two arms and only at a small angle away from the vertical, the laces are pulled to the side and of course the knot is already half-tied. If you changed those initial parameters by a few units, placed the shoe further towards the top or bottom of the table, had both laces on one side, or turned the shoe in a right angle to the arms, the system would fail.
Despite the "generalisation to sweater" video, there's very little flexiblity and very little generalisation, and that's still a system that can perform a handful of discrete tasks (swap gripper, tie shoelaces, hang shirt). That's not a system that can function autonomously in the real world.
No robot maids or android valets coming in a shop near you any time soon, I'm afraid. In the foreseeable future the most successful autonomous systems will remain self-guided munition, which are made to be destroyed on impact and cause maximum damage.
In other words, what would it set me back to recreate this?
The robot arm kits are available commercially: https://www.trossenrobotics.com/aloha-kits
It's my belief that 10x cheaper arms with the same performance are possible, and the only reason they don't exist is because nobody needs them in sufficient quantities.
I have a dream that we put self-replicating robots on Mars and let them build a mostly by-robots for-robots civilization that can potentially export stuff to earth, do various science projects and build spacecraft.
Paper: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation: https://arxiv.org/abs/2401.02117
Video set: https://mobile-aloha.github.io/
Tutorial: https://docs.google.com/document/d/1_3yhWjodSNNYlpxkRCPIlvIA...
Kits for sale: https://www.trossenrobotics.com/aloha-kits
* arms (aka follower arms)
- effector (i.e. gripper)
- sensors (i.e. cameras, depth sensors, specced Intel RealSense D405)
- gravity compensation (so the relatively delicate servos aren't overloaded)
* controller
- runs Robot Operating System (ROS [1]) plus other software (i.e. arm, gripper interfaces [2])
- runs ALOHA model in inference to tell ROS what to do based on task and sensor input
- trains ALOHA models using arm motion encoder and ACT: Action Chunking with Transformers [4]
* leader arms
- motion encoders (essentialy an arm in reverse that can be used by a human to telecontrol the arm to encode motions into model training)
The system at this point is "research grade" which is at once expensive due to custom/nice materials/units and not super user friendly--you must know a lot. See the build instructions [5].0. https://github.com/evildmp/BrachioGraph
2. https://github.com/interbotix
3. https://www.trossenrobotics.com/aloha-kits
4. https://github.com/tonyzhaozh/act
5. https://docs.google.com/document/d/1sgRZmpS7HMcZTPfGy3kAxDrq...
https://chicago.eater.com/2018/7/31/17634686/aloha-poke-co-c...
> the Chicago-born restaurant chain whose attorneys sent cease and desist messages to poke shop owners in Hawai’i, Alaska, and Washington state demanding they change names by dropping the terms “aloha” and “poke” when used together. While Aloha Poke contends it sent notes in a “cooperative manner” to defend intellectual property, Native Hawaiians feel the poke chain is trying to restrict how they can embrace their own heritage.
The other thing is that words have a lot of power in the cultural frame, even just the concept of aloha being something that could be "unleashed" is likely to offend.
All to say nothing off the palpable fear people have here of robots taking hospitality industry jobs like housekeeping (which are unionized in many hotels out here, and are actually one of the few low-barrier-to-entry jobs out here that can support a reasonable quality of life)
I'm sure I'll get a ton of downvotes for bringing up cultural sensitivity and pointing out these concerns -- I don't mean to imply they're all 100% rational nor that no one should say "aloha" unless they're Hawaiian, but if anyone at DeepMind had a Hawaiian cultural frame I think they likely would have flagged these concerns and recommended a different name.
Which is such a shame, as Univ of Hawaii was one of the pioneers of the Internet: https://en.wikipedia.org/wiki/ALOHAnet
Though to be fair, those laces are really long. The robot needs to unlace the shoes, cut some length from the middle, tie a double fisherman's knot, and relace them.
And if we're focusing on the idea, it has existed since the 1950s and they were doing it relatively well then:
I have to disagree here. Not for 20k, but if you could really build a robot arm out of basically a desk lamp, some servos and a camera and had some software to control it as precisely as this video claims it does, this would be a complete game changer. We'd probably see an explosion of attempts to automate all kind of everyday household tasks that are infeasible to automate cost-effectively today (folding laundry, cleaning up the room, cooking, etc)
Also, every self-respecting maker out there would probably try to build one :)
> And if we're focusing on the idea, it has existed since the 1950s and they were doing it relatively well then:
I don't quite understand how the video fits here. That's a manually operated robot arm. The point of Aloha is that it's fully controlled by software, right?
We're still very far from that and you certainly can't do that with ALOHA, in practice, despite what the videos may seem to show. For each of the few, discrete, tasks that you see in the videos, the robot arms have to be trained by demonstration (via teleoperation) and the end result is a system that can only copy the operator's actions with very little variation.
You can check this in the Mobile ALOHA paper on arxiv (https://arxiv.org/abs/2401.02117) where page 6 shows the six tasks the system has been trained to perform, and the tolerances in the initial setup. So e.g. in the shrimp cooking task, the initial position of the robot can vary by 10cm and the position of the implements by 2cm. If everything is not set up just so, the task will fail.
What all this means is that if you could assemble this "cheap" system you'd then have to train it by a few hundred demonstrations to fold your laundry, and maybe it could do it, probably not, and if you moved the washing machine or got a new one, you'd have to train all over again.
As to robots cleaning up your room and cooking, those are currently in the realm of science fiction, unless you're a zen ascetic living in an empty room and happy to eat beans on toast every day. Beans from a can, that is. You'll have to initialise the task by opening the can yourself, obviously. You have a toaster, right?
Yes, that's my point. Cheap hardware is far harder to control than expensive hardware, so if Google actually developed some AI that can do high-precision tasks on "wobbly", off-the-shelf hardware, that would be the breakthrough.
I agree that extensive training for each single device would be prohibitive, but that feels like a problem that could be solved with more development: With many machine learning tasks, we started with individual training a model for each specific use case and environment. Today we're able to make generalized model which are trained once and can be deployed in a wide variety of environments. I don't see why this shouldn't be possible for a vision-based robot controller either.
Managing the actual high-level task is easy as soon as you're able to do all the low-level tasks: I.e., converting a recipe into a machine-readable format, dividing it into a tree of tasks and subtasks etc is easy. The hard parts are actually cutting the vegetables, de-boning the meat, etc. The amount of complex movement planning necessary for that doesn't exist yet. But this project looks as if it's a step in exactly that direction.
Obviously some amount of generalization is required to fold a shirt, as no two shirts will ever be in precisely the same configuration after being dropped on a table by a human. Playback of recorded motor signals could never solve this task.
Is interesting that they are using "Leader Arms" [0] to encode tasks instead of motion capture. Is it just a matter of reduced complexity to get off the ground? I suppose the task of mapping human arm motion to what a robot can do is tough.
Note for example that all the shirts in the videos are oriented in the same direction, with the neck facing to the top of the video. Even then, the system can only straighten a shirt that lands with one corner folded under it after many failed attempts, and if you turned a shirt so that the neck faced downwards, it wouldn't be able to straighten it and hang it no matter how many times it tried. Let's not even talk about getting a shirt tangled in the arms themselves (in the videos, a human intervenes to free the shirt and start again). It's trained to straighten a shirt on the table, with the neck facing one way [1].
So the OP is very right. We're no nearer to real-world autonomy than we were in the '50s. The behaviours of the systems you see in the videos are still hard-coded, only they're hard-coded by demonstration, with extremely low tolerance for variation in tasks or environments, and they still can't do anything they haven't been painstakingly and explicitly shown how to do. This is a sever limitation and without a clear solution to it there's no autonomy.
On the other hand, ιδού πεδίον δόξης λαμπρόν, as we say in Greek. This is a wide open field full of hills to plant one's flag on. There's so much that robotic autonomy can't yet do that you can get Google to fund you if you can show a robot tying half a knot.
__________________
[1] Note btw that straightening the shirt is pointless: it will straighten up when you hang it. That's just to show the robot can do some random moves and arrive at a result that maybe looks meaningful to a human, but there's no way to tell whether the robot is sensing that it achieved a goal, or not. The straightening part is just a gimmick.
It is true that replay in the world frame will not handle initial position changes for the shirt. But if the commands are in the frame of the end-effector and the data is object-centric, replay will somewhat generalize.(Please also consider the fact that you are watching the videos that have survived the "should I upload this?" filter.)
The second thing is that large-scale behavior cloning (which is the technique used here), is essentially replay with a little smoothing. Not bad inherently, but just a fact.
My point is that there was an academic contribution made back when the first aloha paper came out and they showed doing BC on low-quality hardware could work, but this is like the 4th paper in a row of sort of the same stuff.
Since this is YC, I'll add - As an academic (physics) turned investor, I would like to see more focus on systems engineering and first-principles thinking. Less PR for the sake of PR. I love robotics and really want to see this stuff take off, but for the right reasons.
A definition of "replay" that involves extensive correction based on perception in the loop is really stretching it. But let me take your argument at face value. This is essentially the same argument that people use to dismiss GPT-4 as "just" a stochastic parrot. Two things about this:
One, like GPT-4, replay with generalization based on perception can be exceedingly useful by itself, far more so than strict replay, even if the generalization is limited.
Two, obviously this doesn't generalize as much as GPT-4. But the reason is that it doesn't have enough training data. With GPT-4 scale training data it would generalize amazingly well and be super useful. Collecting human demonstrations may not get us to GPT-4 scale, but it will be enough to bootstrap a robot useful enough to be deployed in the field. Once there is a commercially successful dextrous robot in the field we will be able to collect orders of magnitude more data, unsupervised data collection should start to work, and robotics will fall to the bitter lesson just as vision, ASR, TTS, translation, and NLP before.
That's not something that you can solve with learning from data, alone. A real-world autonomous system must be able to deal with situations that it has no experience with, it has to be able to deal with them as they unfold, and it has to learn from them general strategies that it can apply to more novel situations. That is a problem that, by definition, cannot be solved by any approach that must be trained offline on many examples of specific situations.
Another limiting factor is that data collection is a big problem: not only will you never be sure you've collected enough data, they're collecting data of a human trying to do this work through a janky teleoperation rig. The behavior they're trying to clone is of a human working poorly, which isn't a great source of data! Furthermore limiting the data collection to (typically) 10Hz means that the scene will always have to be quasi-static, and I'm not sure these huge models will speed up enough to actually understand velocity as a 'sufficient statistic' of the underlying dynamics.
Ultimately, it's been frustrating to see so much money dumped into the recent humanoid push using teleop / BC. It's going to hamper the folks actually pursing first-principles thinking.
>> It's going to hamper the folks actually pursing first-principles thinking.
Nah.
Aloha was not new, but it’s still good work because robotics researchers were not focused on this form of data collection. The issue was most people went into the simulation rabbit hole where they had to solve sim-to-real.
Others went into the VR handset and hand tracking idea, where you never got super precise manipulations and so any robots trained on that always showed choppy movement.
Others including OpenAI decided to go full reinforcement learning foregoing human demonstrations which had some decent results but after 6 months of RL on an arm farm led by Google and Sergey Levine, the results were underwhelming to say the least.
So yes it’s not like Aloha invented teleoperation, they demonstrated that using this mode of teleoperation you could collect a lot of data that can train autonomous robot policies easily and beat other methods which I think is a great contribution!
[0]Unlike the people who downvoted me for asking a question.
In any case, the real start of this show is clearly the shirt hanging.