Researchers have introduced UniT, a novel framework that bridges the gap between human motion data and humanoid robot learning. By using a Unified Latent Action Tokenizer, UniT maps diverse human and humanoid actions into shared latent physical intent tokens, enabling efficient cross-embodiment transfer.
The framework comprises two main components: VLA-UniT for humanoid policy learning and WM-UniT for world modeling. This dual approach allows for zero-shot humanoid task learning and controllable humanoid video generation, leveraging the abundance of human motion data.
Key features include:
- Unified representation: Heterogeneous actions are tokenized into a common latent space.
- Cross-embodiment transfer: Learned policies can be transferred from human data to humanoid robots.
- Scalability: The framework efficiently utilizes large-scale human motion datasets.
UniT represents a significant step toward creating a universal "physical language" that can be shared between humans and humanoid robots, potentially accelerating the development of more capable and adaptable humanoid systems.