Reborn Embodied Vlog

Robots learn embodied manipulation from human handwork.

Reborn Embodied Vlog

"Reborn Embodied Vlog" (REV) is designed as a mobile app that enables users to record and upload first-person perspective videos of fine manipulation tasks from their daily lives. The process is simple:

  • Recording: Users use their smartphones, GoPro, or other cameras to capture videos of themselves performing detailed tasks, such as preparing food, cleaning, assembling objects, or other hand-manipulation actions.

  • Uploading: The app allows users to upload these videos directly to the platform, where the footage is anonymized and processed.

  • Video Analysis: The system processes the video data to extract hand movement, gestures, object interactions, and fine manipulation patterns.

  • Global Contribution: By contributing their data, users help build a massive, diverse dataset that is accessible to AI systems worldwide, improving the accuracy and efficiency of robot manipulation tasks.

Transfer Reborn Embodied Vlog to Embodied Training Data

To process first-person perspective video data into hand landmarks for robotic dexterous hand training, a structured pipeline can be followed:

1. Video Preprocessing

  • Frame Extraction: Convert video into a sequence of frames to analyze each individually. Select a suitable frame rate (e.g., 30 FPS) to balance detail and computational cost.

  • Image Enhancement: Improve video quality (e.g., brightness, contrast) to ensure clear visualization of hands and objects in various lighting conditions.

  • Segmentation: Use a hand segmentation algorithm to isolate the hand region from the background, reducing noise and focusing the analysis.

2. Hand Landmark Detection

  • Pose Estimation Models: Utilize state-of-the-art hand pose estimation models, such as Mediapipe Hand Tracking or DeepHand, to detect keypoints on the hand.

    • Detect key landmarks such as fingertips, knuckles, wrist, and palm center.

    • Use 2D/3D coordinate extraction to map hand keypoints relative to the frame or environment.

  • 3D Reconstruction (if needed): Use stereo cameras or infer depth from monocular video using advanced models like DensePose or MANO (Model-based Articulated Hand Object).