Reborn Embodied Vlog
Robots learn embodied manipulation from human handwork.
Last updated
Robots learn embodied manipulation from human handwork.
Last updated
"Reborn Embodied Vlog" (REV) is designed as a mobile app that enables users to record and upload first-person perspective videos of fine manipulation tasks from their daily lives. The process is simple:
Recording: Users use their smartphones, GoPro, or other cameras to capture videos of themselves performing detailed tasks, such as preparing food, cleaning, assembling objects, or other hand-manipulation actions.
Uploading: The app allows users to upload these videos directly to the platform, where the footage is anonymized and processed.
Video Analysis: The system processes the video data to extract hand movement, gestures, object interactions, and fine manipulation patterns.
Global Contribution: By contributing their data, users help build a massive, diverse dataset that is accessible to AI systems worldwide, improving the accuracy and efficiency of robot manipulation tasks.
To process first-person perspective video data into hand landmarks for robotic dexterous hand training, a structured pipeline can be followed:
Frame Extraction: Convert video into a sequence of frames to analyze each individually. Select a suitable frame rate (e.g., 30 FPS) to balance detail and computational cost.
Image Enhancement: Improve video quality (e.g., brightness, contrast) to ensure clear visualization of hands and objects in various lighting conditions.
Segmentation: Use a hand segmentation algorithm to isolate the hand region from the background, reducing noise and focusing the analysis.
Pose Estimation Models: Utilize state-of-the-art hand pose estimation models, such as Mediapipe Hand Tracking or DeepHand, to detect keypoints on the hand.
Detect key landmarks such as fingertips, knuckles, wrist, and palm center.
Use 2D/3D coordinate extraction to map hand keypoints relative to the frame or environment.
3D Reconstruction (if needed): Use stereo cameras or infer depth from monocular video using advanced models like DensePose or MANO (Model-based Articulated Hand Object).