Reborn Data-to-Model Pipeline

From Human Behavior Data to Embodied Intelligence

To train generalizable Robotic Foundation Models (RFMs), Reborn has developed a comprehensive multimodal data collection framework that captures human behavior across both physical and virtual contexts. This framework ensures that robots learn not only from idealized simulations but also from the rich variability of real-world interactions.

What Data Reborn Collects and Provides

Reborn’s data engine integrates four complementary streams, each contributing a unique modality for embodied AI:

Reborn Mocap Life (Rebocap™) Full-body human keypoint trajectories collected through wearable motion capture in natural settings or VR. Use case: Humanoid locomotion models, full-body control, and imitation learning.
Reborn VR Gaming Immersive virtual reality tasks using devices like Apple Vision Pro, capturing hand landmarks, gestures, and object interactions. Use case: Dexterous manipulation models, ergonomic HCI training, and collaborative tasks.
Roboverse Simulation A large-scale simulation platform used to generate synthetic data across diverse task settings, sensor views, and robot embodiments. Use case: Pretraining large-scale control policies, sim-to-real adaptation, long-horizon planning.
Reborn Embodied Vlog Real-world, first-person perspective videos of fine-manipulation tasks (e.g., cooking, cleaning). Use case: Vision-language-action models, task segmentation, and motion planning.

These datasets are collected through intuitive and engaging user interfaces—such as games, home routines, and VR activities—ensuring natural behavior and scalable participation.

Why Reborn Data is Precious

Reborn’s data is precious because it provides a complete, high-resolution view of human behavior through its end-to-end collection approach and achieves unmatched scale and diversity through its Web3-enabled ecosystem. This combination ensures that embodied AI systems trained on Reborn’s data are robust, adaptable, and capable of human-like performance in real-world scenarios. Reborn is setting a new standard for how data collection can drive progress in robotics and artificial intelligence.

End-to-End Data Collection: Reborn captures a complete record of human motion across daily life, from fine hand manipulations to full-body movements, in diverse real-world contexts. By integrating video, motion capture, and VR data, it provides a rich, multimodal dataset essential for training embodied AI systems to perform human-like tasks with precision and context awareness.
Scale and Diversity Enabled by Web3: Leveraging Web3 technology, Reborn ensures global participation and incentivizes contributors with tokenized rewards. This decentralized approach guarantees vast scale and diversity, conforming to the data scaling law (RFM) and enabling robust, generalizable AI models capable of adapting to varied environments and tasks.

From Reborn Data to Reborn Models

Each data type serves as input to different model families:

Data Type

Model Focus

Embodied Vlog

Vision-language planning [1], object detection, scene grounding

Mocap Life

Full-body control, humanoid locomotion, pose imitation

VR Gaming

Hand manipulation [3, 4], fine motor control, human-robot interaction [2], dex grasping

Roboverse Sim

Pretraining RL/VLA models, multi-task generalization, edge case synthesis

These data streams are aligned with Reborn’s progressive strategy of incubating vertical models before converging toward RFMs. By covering both low-level control and high-level semantic reasoning, Reborn’s dataset enables multi-layered learning.

Training RFM with Reborn Data: An Imitation Learning Example

Learning with Manipulation Data from VR and Video

Imitation learning [6] enables robots to acquire complex manipulation skills by learning from expert demonstrations. This process involves mapping observed states to corresponding actions, allowing robots to replicate desired behaviors. We use the manipulation task [7] as an example.

Problem Formulation

Consider a robot operating within a Markov Decision Process (MDP) defined by the tuple $(\mathcal{S}, \mathcal{A}, P, R)$ , where:

$\mathcal{S}$ is the state space.
$\mathcal{A}$ is the action space.
$P(s' \mid s, a)$ represents the transition probability from state $s$ to state $s'$ given action $a$ .
$R(s, a)$ denotes the reward received after taking action $a$ in state $s$ .

The goal is to learn a policy $\pi: \mathcal{S} \rightarrow \mathcal{A}$ that maps each state $s \in \mathcal{S}$ to an action $a \in \mathcal{A}$ , effectively imitating the behavior demonstrated by an expert.

Data Collection

An expert provides a dataset $\mathcal{D} = {(s_i, a_i)}_{i=1}^N$ of $N$ state-action pairs, where each pair consists of a state $s_i$ and the corresponding action $a_i$ taken by the expert.

Policy Representation

The policy $\pi_\theta$ is parameterized by $\theta$ , often using a neural network. The network takes a state $s$ as input and outputs a probability distribution over possible actions, $\pi_\theta(a \mid s)$ .

Loss Function Design

To train the policy, we minimize the discrepancy between the actions predicted by the policy and the expert's actions. A common approach is to use the cross-entropy loss:

\mathcal{L}(\theta) = -\frac{1}{N} \sum_{i=1}^N \log \pi_\theta(a_i \mid s_i)

This loss function penalizes the model when the predicted action probability $\pi_\theta(a_i \mid s_i)$ is low for the expert's action $a_i$ .

Training with SGD and mini-batch Gradient Descent

The parameters $\theta$ are optimized using Stochastic Gradient Descent. The update rule for $\theta$ at iteration $t$ is:

\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t)

In practice, the dataset $\mathcal{D}$ is divided into mini-batches of size $M$ . For each mini-batch $\mathcal{B} \subset \mathcal{D}$ , the gradient is computed as:

\nabla_\theta \mathcal{L}_\mathcal{B}(\theta) = -\frac{1}{M} \sum_{(s_i, a_i) \in \mathcal{B}} \nabla_\theta \log \pi_\theta(a_i \mid s_i)

The parameters are then updated using this mini-batch gradient:

\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}_\mathcal{B}(\theta_t)

By following this process, imitation learning enables robots to acquire manipulation skills by learning from expert demonstrations, effectively mapping observed states to appropriate actions.

Training Robot for Locomotion and Planning from Human Mocap Data:

Human full-body motion capture (mocap) data provides an invaluable foundation for training Task and Motion Planning (TAMP) models, enabling robots to learn complex task execution through a combination of precise motion data and contextual task information. This integration forms the basis for Vision-Language Action (VLA) models, where human motions are aligned with task narratives to teach robots how to plan and execute actions effectively in dynamic environments.

How Mocap Data Enables TAMP Training

Capturing Precise Motion Dynamics: Full-body mocap data provides detailed 3D representations of human joint movements, capturing kinematic patterns during task execution. By incorporating this data, TAMP models gain an accurate understanding of human motion trajectories, joint coordination, and body mechanics, which serve as benchmarks for robotic motion planning.
Task-Context Alignment: Motion data is paired with annotated task descriptions, offering TAMP models the ability to associate physical actions with their contextual purposes. For example, motions like "bending down" can be linked to tasks such as "picking up an object." This alignment enhances the robot’s capability to interpret and adapt to task-specific scenarios.
Learning Action Sequences: By observing and analyzing sequential motions, TAMP models learn to break down complex tasks into smaller, manageable sub-actions. For instance, assembling furniture can be modeled as a sequence of movements: "aligning parts," "tightening screws," and "testing stability." These action sequences are crucial for generating coherent and feasible robotic task plans.
Imitation-Based Skill Acquisition: Robots can utilize mocap data of human demonstrations to learn motor skills through imitation. This process reduces reliance on traditional programming by allowing robots to replicate demonstrated behaviors, such as navigating obstacles or manipulating objects, directly from the data.
Dynamic Adaptation: Mocap data provides a diverse range of human motion examples across varying conditions, enabling TAMP models to account for environmental and task variability. This adaptability is essential for real-world applications, where robots must dynamically respond to changing scenarios.

Key Benefits:

Understanding Human Motion: Motion capture data offers precise measurements of human joint movements during task execution. By analyzing this data, robots can develop a nuanced understanding of human kinematics, enabling them to mimic actions accurately.
Contextual Task Interpretation: Pairing motion data with task descriptions helps robots comprehend the intent and context behind each movement. This holistic perspective is crucial for executing tasks that require adaptability to dynamic environments.
Learning from Demonstrations: Utilizing human demonstrations as references allows robots to acquire complex motor skills through imitation, streamlining the learning process and reducing the need for extensive programming.

Application in Task and Motion Planning (TAMP):

Incorporating human joint data into TAMP frameworks enables robots to plan and execute tasks that closely align with human strategies. Our approach facilitates seamless human-robot collaboration and enhances the robot's ability to perform tasks in a manner that is intuitive and efficient.

The study "Motion Planning through Demonstration to Deal with Complex Motions in Assembly Process" [5] demonstrates the effectiveness of using human motion data to inform robotic motion planning, particularly in complex assembly tasks. By capturing human movements, the research provides robots with the necessary information to replicate intricate motions, thereby improving task execution accuracy. By leveraging human joint data alongside task descriptions, robotic systems can achieve more natural and effective task planning and execution, ultimately leading to improved performance in real-world applications.

Our data also helps robots better understand human motions and translate them into languages for LLM or VLA to understand. For instance, the study "MotionGPT: Human Motion as a Foreign Language" [8] presents a unified motion-language generation model that treats human motion data as a language, allowing for the synthesis and understanding of diverse motions guided by textual descriptions. With the textual descriptions, the brain of robots, LLM, can understand the human motions and make corresponding actions.

References

"Humanoid Robot Motion Planning Approaches: a Survey," Journal of Intelligent & Robotic Systems, 2023.
"Progress and prospects of the human–robot collaboration." Autonomous robots 42 (2018): 957-975.
"Trends and challenges in robot manipulation." Science 364.6446 (2019)
"Survey of Learning Approaches for Robotic In-Hand Manipulation." arXiv preprint arXiv:2401.07915 (2024).
"Motion planning through demonstration to deal with complex motions in assembly process." 2019 IEEE-RAS 19th International Conference on Humanoid Robots (Humanoids)
"A survey of imitation learning: Algorithms, recent developments, and challenges." IEEE Transactions on Cybernetics (2024).
"Decomposing the generalization gap in imitation learning for visual robotic manipulation." 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024.
"Motiongpt: Human motion as a foreign language." Advances in Neural Information Processing Systems 36 (2023): 20067-20079.

PreviousReborn Flywheel for RFM NextReborn's Partnerships