Reborn Embodied Vlog
Robots learn embodied manipulation from human handwork.
Last updated
Robots learn embodied manipulation from human handwork.
Last updated
"Reborn Embodied Vlog" (REV) is designed as a mobile app that enables users to record and upload first-person perspective videos of fine manipulation tasks from their daily lives. The process is simple:
Recording: Users use their smartphones, GoPro, or other cameras to capture videos of themselves performing detailed tasks, such as preparing food, cleaning, assembling objects, or other hand-manipulation actions.
Uploading: The app allows users to upload these videos directly to the platform, where the footage is anonymized and processed.
Video Analysis: The system processes the video data to extract hand movement, gestures, object interactions, and fine manipulation patterns.
Global Contribution: By contributing their data, users help build a massive, diverse dataset that is accessible to AI systems worldwide, improving the accuracy and efficiency of robot manipulation tasks.
To process first-person perspective video data into hand landmarks for robotic dexterous hand training, a structured pipeline can be followed:
Frame Extraction: Convert video into a sequence of frames to analyze each individually. Select a suitable frame rate (e.g., 30 FPS) to balance detail and computational cost.
Image Enhancement: Improve video quality (e.g., brightness, contrast) to ensure clear visualization of hands and objects in various lighting conditions.
Segmentation: Use a hand segmentation algorithm to isolate the hand region from the background, reducing noise and focusing the analysis.
Pose Estimation Models: Utilize state-of-the-art hand pose estimation models, such as Mediapipe Hand Tracking or DeepHand, to detect keypoints on the hand.
Detect key landmarks such as fingertips, knuckles, wrist, and palm center.
Use 2D/3D coordinate extraction to map hand keypoints relative to the frame or environment.
3D Reconstruction (if needed): Use stereo cameras or infer depth from monocular video using advanced models like DensePose or MANO (Model-based Articulated Hand Object).
Landmark Smoothing: Apply temporal smoothing algorithms (e.g., Kalman filters or Savitzky–Golay filters) to reduce noise and ensure continuous motion tracking between frames.
Gesture Recognition: Identify and label hand gestures or manipulation patterns using gesture classification models.
Object Detection: Use object detection algorithms (e.g., YOLO, Faster R-CNN) to identify objects the hand interacts with.
Hand-Object Relationship: Map hand landmarks to object interaction points, such as the grasp points on an object.
Annotation Tools: Label the frames with structured metadata, including hand pose, gesture type, object type, and interaction style.
Dataset Structuring: Organize the processed data into a format compatible with robotic training frameworks, such as:
CSV or JSON files for keypoints data.
Video clips paired with landmark annotations.
Object interaction labels.
Convert Landmarks into Robotic Coordinates: Map the extracted hand landmarks into the robot’s coordinate system, adapting for scale, orientation, and joint constraints.
Simulated Tasks: Use tools like PyBullet, MuJoCo, or Isaac Gym to simulate robotic hands performing the tasks captured in the video, using the processed landmarks as target positions for joints.
The videos uploaded by users are vital for training robots to perform fine manipulation tasks in the real world:
Human-Like Actions: The data provides robots with real-world examples of intricate human tasks, showing how hands and objects interact in various environments.
Contextual Learning: By analyzing a wide variety of tasks from diverse environments, robots can better understand contextual clues, such as different object shapes, sizes, textures, and how they need to be manipulated to perform tasks like making a sandwich or cleaning a dish.
Skill Transfer: The robot learns from human-like actions in both familiar and unfamiliar scenarios, allowing it to generalize these skills across different tasks and settings.
Fine Manipulation Mastery: By training with first-person perspective videos, robots can refine their ability to perform delicate tasks, which are typically difficult for machines to master due to the nuanced control needed.