Reborn Data-to-Model Pipeline
Last updated
Last updated
To train generalizable Robotic Foundation Models (RFMs), Reborn has developed a comprehensive multimodal data collection framework that captures human behavior across both physical and virtual contexts. This framework ensures that robots learn not only from idealized simulations but also from the rich variability of real-world interactions.
Reborn’s data engine integrates four complementary streams, each contributing a unique modality for embodied AI:
Reborn Mocap Life (Rebocap™) Full-body human keypoint trajectories collected through wearable motion capture in natural settings or VR. Use case: Humanoid locomotion models, full-body control, and imitation learning.
Reborn VR Gaming Immersive virtual reality tasks using devices like Apple Vision Pro, capturing hand landmarks, gestures, and object interactions. Use case: Dexterous manipulation models, ergonomic HCI training, and collaborative tasks.
Roboverse Simulation A large-scale simulation platform used to generate synthetic data across diverse task settings, sensor views, and robot embodiments. Use case: Pretraining large-scale control policies, sim-to-real adaptation, long-horizon planning.
Reborn Embodied Vlog Real-world, first-person perspective videos of fine-manipulation tasks (e.g., cooking, cleaning). Use case: Vision-language-action models, task segmentation, and motion planning.
These datasets are collected through intuitive and engaging user interfaces—such as games, home routines, and VR activities—ensuring natural behavior and scalable participation.
Reborn’s data is precious because it provides a complete, high-resolution view of human behavior through its end-to-end collection approach and achieves unmatched scale and diversity through its Web3-enabled ecosystem. This combination ensures that embodied AI systems trained on Reborn’s data are robust, adaptable, and capable of human-like performance in real-world scenarios. Reborn is setting a new standard for how data collection can drive progress in robotics and artificial intelligence.
End-to-End Data Collection: Reborn captures a complete record of human motion across daily life, from fine hand manipulations to full-body movements, in diverse real-world contexts. By integrating video, motion capture, and VR data, it provides a rich, multimodal dataset essential for training embodied AI systems to perform human-like tasks with precision and context awareness.
Scale and Diversity Enabled by Web3: Leveraging Web3 technology, Reborn ensures global participation and incentivizes contributors with tokenized rewards. This decentralized approach guarantees vast scale and diversity, conforming to the data scaling law (RFM) and enabling robust, generalizable AI models capable of adapting to varied environments and tasks.
Each data type serves as input to different model families:
Data Type
Model Focus
Embodied Vlog
Vision-language planning [1], object detection, scene grounding
Mocap Life
Full-body control, humanoid locomotion, pose imitation
VR Gaming
Hand manipulation [3, 4], fine motor control, human-robot interaction [2], dex grasping
Roboverse Sim
Pretraining RL/VLA models, multi-task generalization, edge case synthesis
These data streams are aligned with Reborn’s progressive strategy of incubating vertical models before converging toward RFMs. By covering both low-level control and high-level semantic reasoning, Reborn’s dataset enables multi-layered learning.
Imitation learning [6] enables robots to acquire complex manipulation skills by learning from expert demonstrations. This process involves mapping observed states to corresponding actions, allowing robots to replicate desired behaviors. We use the manipulation task [7] as an example.
Problem Formulation
Consider a robot operating within a Markov Decision Process (MDP) defined by the tuple , where:
is the state space.
is the action space.
represents the transition probability from state to state given action .
denotes the reward received after taking action in state .
The goal is to learn a policy that maps each state to an action , effectively imitating the behavior demonstrated by an expert.
Data Collection
An expert provides a dataset of state-action pairs, where each pair consists of a state and the corresponding action taken by the expert.
Policy Representation
The policy is parameterized by , often using a neural network. The network takes a state as input and outputs a probability distribution over possible actions, .
Loss Function Design
To train the policy, we minimize the discrepancy between the actions predicted by the policy and the expert's actions. A common approach is to use the cross-entropy loss:
This loss function penalizes the model when the predicted action probability is low for the expert's action .
Training with SGD and mini-batch Gradient Descent
The parameters are optimized using Stochastic Gradient Descent. The update rule for at iteration is:
In practice, the dataset is divided into mini-batches of size . For each mini-batch , the gradient is computed as:
The parameters are then updated using this mini-batch gradient:
By following this process, imitation learning enables robots to acquire manipulation skills by learning from expert demonstrations, effectively mapping observed states to appropriate actions.
Human full-body motion capture (mocap) data provides an invaluable foundation for training Task and Motion Planning (TAMP) models, enabling robots to learn complex task execution through a combination of precise motion data and contextual task information. This integration forms the basis for Vision-Language Action (VLA) models, where human motions are aligned with task narratives to teach robots how to plan and execute actions effectively in dynamic environments.
Capturing Precise Motion Dynamics: Full-body mocap data provides detailed 3D representations of human joint movements, capturing kinematic patterns during task execution. By incorporating this data, TAMP models gain an accurate understanding of human motion trajectories, joint coordination, and body mechanics, which serve as benchmarks for robotic motion planning.
Task-Context Alignment: Motion data is paired with annotated task descriptions, offering TAMP models the ability to associate physical actions with their contextual purposes. For example, motions like "bending down" can be linked to tasks such as "picking up an object." This alignment enhances the robot’s capability to interpret and adapt to task-specific scenarios.
Learning Action Sequences: By observing and analyzing sequential motions, TAMP models learn to break down complex tasks into smaller, manageable sub-actions. For instance, assembling furniture can be modeled as a sequence of movements: "aligning parts," "tightening screws," and "testing stability." These action sequences are crucial for generating coherent and feasible robotic task plans.
Imitation-Based Skill Acquisition: Robots can utilize mocap data of human demonstrations to learn motor skills through imitation. This process reduces reliance on traditional programming by allowing robots to replicate demonstrated behaviors, such as navigating obstacles or manipulating objects, directly from the data.
Dynamic Adaptation: Mocap data provides a diverse range of human motion examples across varying conditions, enabling TAMP models to account for environmental and task variability. This adaptability is essential for real-world applications, where robots must dynamically respond to changing scenarios.
Understanding Human Motion: Motion capture data offers precise measurements of human joint movements during task execution. By analyzing this data, robots can develop a nuanced understanding of human kinematics, enabling them to mimic actions accurately.
Contextual Task Interpretation: Pairing motion data with task descriptions helps robots comprehend the intent and context behind each movement. This holistic perspective is crucial for executing tasks that require adaptability to dynamic environments.
Learning from Demonstrations: Utilizing human demonstrations as references allows robots to acquire complex motor skills through imitation, streamlining the learning process and reducing the need for extensive programming.
Incorporating human joint data into TAMP frameworks enables robots to plan and execute tasks that closely align with human strategies. Our approach facilitates seamless human-robot collaboration and enhances the robot's ability to perform tasks in a manner that is intuitive and efficient.
The study "Motion Planning through Demonstration to Deal with Complex Motions in Assembly Process" [5] demonstrates the effectiveness of using human motion data to inform robotic motion planning, particularly in complex assembly tasks. By capturing human movements, the research provides robots with the necessary information to replicate intricate motions, thereby improving task execution accuracy. By leveraging human joint data alongside task descriptions, robotic systems can achieve more natural and effective task planning and execution, ultimately leading to improved performance in real-world applications.
Our data also helps robots better understand human motions and translate them into languages for LLM or VLA to understand. For instance, the study "MotionGPT: Human Motion as a Foreign Language" [8] presents a unified motion-language generation model that treats human motion data as a language, allowing for the synthesis and understanding of diverse motions guided by textual descriptions. With the textual descriptions, the brain of robots, LLM, can understand the human motions and make corresponding actions.
References
"Humanoid Robot Motion Planning Approaches: a Survey," Journal of Intelligent & Robotic Systems, 2023.
"Progress and prospects of the human–robot collaboration." Autonomous robots 42 (2018): 957-975.
"Trends and challenges in robot manipulation." Science 364.6446 (2019)
"Survey of Learning Approaches for Robotic In-Hand Manipulation." arXiv preprint arXiv:2401.07915 (2024).
"Motion planning through demonstration to deal with complex motions in assembly process." 2019 IEEE-RAS 19th International Conference on Humanoid Robots (Humanoids)
"A survey of imitation learning: Algorithms, recent developments, and challenges." IEEE Transactions on Cybernetics (2024).
"Decomposing the generalization gap in imitation learning for visual robotic manipulation." 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024.
"Motiongpt: Human motion as a foreign language." Advances in Neural Information Processing Systems 36 (2023): 20067-20079.