Robotic Foundation Models
How data helps future humanoid robots reborn
Last updated
How data helps future humanoid robots reborn
Last updated
Robotic Foundation Models (RFMs) [1] represent a new paradigm in embodied AI—models designed to generalize across tasks, sensor modalities, and robot embodiments. Similar to how foundation models like GPT have transformed natural language understanding, RFMs aim to serve as a universal intelligence layer for robots, enabling them to perceive, plan, and act across diverse real-world environments.
RFMs are essential to the future of AGI robots—humanoid agents that can adapt to new situations, assist in dynamic tasks, and collaborate fluidly with humans. However, building such models requires an unprecedented scale and diversity of data, as well as deployment infrastructure to continuously collect real-world feedback.
Robotic Foundation Models (RFMs) aim to generalize across a wide range of embodied intelligence tasks. At Reborn, we organize these topics into several key research and application domains, each corresponding to a family of models developed or co-developed within our ecosystem:
Vision-Language-Action (VLA) Models that link visual perception, natural language instructions, and motor commands for high-level task planning. Example: OpenVLA [3] – enabling robots to execute commands like “pick up the red mug and place it on the table.”
Whole-body Locomotion and Coordination Models that control full-body movements for walking, standing, navigating, and balancing across terrains. Example: Imitation learning framework [4] based on an adversarial motion prior
Dexterous Grasping and In-hand Manipulation Fine motor control for object handling, including finger-level adaptation and object reorientation.
Embodied Task Planning and Execution End-to-end models for multi-step task planning in 3D environments, integrating memory and perception. Example: Saycan [5] – Grounding language in robotic affordances.
Data Scale and Diversity:
Traditional Robotics: Models are trained on small, task-specific datasets, limiting their adaptability.
RFMs: Leverage large-scale, internet-derived datasets, enabling superior generalization and problem-solving beyond the training scope.
Learning Paradigms:
Traditional Robotics: Focuses on narrowly defined models tailored to specific tasks, often requiring manual tuning.
RFMs: Employ foundation models pre-trained on extensive, multimodal data, demonstrating emergent abilities like zero-shot learning.
Adaptability and Generalization:
Traditional Robotics: Models typically perform well only within the boundaries of their training data.
RFMs: Exhibit remarkable adaptability, enabling their use across a variety of tasks with minimal retraining.
Integration of Multimodal Data:
Traditional Robotics: Relies on specific types of data processed in isolation.
RFMs: Combine multimodal inputs—such as vision, language, and sensory data—into unified representations, enhancing understanding and interaction.
Emergent Capabilities:
Traditional Robotics: Limited to predefined tasks, with challenges in addressing unforeseen scenarios.
RFMs: Capable of tackling novel problems using zero-shot learning and other emergent abilities.
A recent study [2] investigates how the quantity and diversity of training data affect a robot's ability to generalize in robotic manipulation tasks. The authors conducted an extensive empirical study, collecting over 40,000 demonstrations and performing more than 15,000 real-world robot rollouts. Their key findings include:
Power-Law Relationship of Scale and Performance: The generalization performance of robotic policies improves following a power-law trend as the number of training environments and objects increases. This indicates that doubling the diversity of training data leads to consistent, albeit diminishing, performance gains.
Importance of Diversity: The diversity of training environments and objects is very crucial. The RFM performance increases with the sheer number of demonstrations per environment or object. This law further results in the difficulty of data collection in diverse environments.
Efficient Data Collection Strategy: Gathering diverse experiences across various settings helped train policies achieving approximately 90% success rates in novel environments with unseen objects.
These findings suggest that a new scaling law is built in RFM, i.e., the scale and diversity of data and the ability of the RFM. To build a stronger RFM, data is necessary.
Data forms the backbone of RFMs, determining their capacity for generalization and adaptability. Several types of data are critical for building effective RFMs:
Visual Data:
Essential for object detection, scene understanding, and spatial reasoning.
Includes 2D images, videos, and 3D data like point clouds and volumetric representations.
Natural Language Data:
Enables RFMs to interpret and execute high-level commands using natural language.
Includes annotated datasets of human instructions, narratives, and dialogues.
Human Demonstration Data:
Enhances models' ability to replicate human actions via imitation learning.
Combines visual and linguistic annotations to provide contextually rich demonstrations
Synthetic and Simulated Data:
Bridges the gap in real-world data availability by augmenting datasets with realistic simulations.
Useful for tasks like navigation and planning in dynamic or hazardous environments.
[2] Lin, Fanqi, et al. "Data Scaling Laws in Imitation Learning for Robotic Manipulation." arXiv:2410.18647.
[3] Kim, Moo Jin, et al. "OpenVLA: An Open-Source Vision-Language-Action Model." arXiv preprint arXiv:2406.09246 (2024).
[4] Zhang, Qiang, et al. "Whole-body humanoid robot locomotion with human reference." 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024.
[5] Brohan, Anthony, et al. "Do as i can, not as i say: Grounding language in robotic affordances." Conference on robot learning. PMLR, 2023..
[1] Roya Firoozi et al., "Foundation Models in Robotics: Applications, Challenges, and the Future,"