Abstract
Visual imitation learning (VIL) provides an efficient and intuitive strategy for robotic systems to acquire novel skills. Recent advancements in foundation models, particularly Vision Language Models (VLMs), have demonstrated remarkable performance in vision and language reasoning capabilities for VIL tasks. Despite this progress, current VIL methods naively employ these models to learn high-level plans from human videos, relying on pre-defined motion primitives for executing physical interactions, which remains a major bottleneck for robotic systems. In this work, we present FMimic, a novel paradigm that harnesses foundation models to directly learn generalizable skills with even fine-grained action levels, only given a limited number of human videos. Specifically, human-object movements are initially grounded from human videos, then the skill learner delineates motion properties through keypoints and waypoints, and acquires fine-grained action skills using hierarchical constraint representations. Under unseen scenarios, the learned skills are updated through keypoint transfer and iterative comparison in the skill adpater, enabling efficient skill adaptation. To enhance the high-precision manipulation performance, the skill refiner optimizes the extracted and transferred interactions for enhanced precision, and the pose estimation results are refined through iterative master-slave contact, facilitating the acquisition and accomplishment of even highly constrained manipulation tasks. Our concise approach enables FMimic to effectively learn fine-grained actions from human videos, obviating the reliance on predefined primitives. Extensive experiments exhibit that our FMimic achieves strong performance with a single human video, and significantly outperforms all other methods with five videos. Our method demonstrates significant improvements of over 22% and 29% in RLBench and real-world manipulation tasks, and surpasses baselines by more than 34% in high-precision tasks and 47% in long-horizon tasks.