FMimic: Foundation Models are Fine-grained Action Learners from Human Videos

1Beijing Institute of Technology 2The University of Hong Kong

Abstract

Visual imitation learning (VIL) provides an efficient and intuitive strategy for robotic systems to acquire novel skills. Recent advancements in foundation models, particularly Vision Language Models (VLMs), have demonstrated remarkable performance in vision and language reasoning capabilities for VIL tasks. Despite this progress, current VIL methods naively employ these models to learn high-level plans from human videos, relying on pre-defined motion primitives for executing physical interactions, which remains a major bottleneck for robotic systems. In this work, we present FMimic, a novel paradigm that harnesses foundation models to directly learn generalizable skills with even fine-grained action levels, only given a limited number of human videos. Specifically, human-object movements are initially grounded from human videos, then the skill learner delineates motion properties through keypoints and waypoints, and acquires fine-grained action skills using hierarchical constraint representations. Under unseen scenarios, the learned skills are updated through keypoint transfer and iterative comparison in the skill adpater, enabling efficient skill adaptation. To enhance the high-precision manipulation performance, the skill refiner optimizes the extracted and transferred interactions for enhanced precision, and the pose estimation results are refined through iterative master-slave contact, facilitating the acquisition and accomplishment of even highly constrained manipulation tasks. Our concise approach enables FMimic to effectively learn fine-grained actions from human videos, obviating the reliance on predefined primitives. Extensive experiments exhibit that our FMimic achieves strong performance with a single human video, and significantly outperforms all other methods with five videos. Our method demonstrates significant improvements of over 22% and 29% in RLBench and real-world manipulation tasks, and surpasses baselines by more than 34% in high-precision tasks and 47% in long-horizon tasks.


Subtasks

open drawer

stack block

open oven

pick mango to plate

press button

open microwave

put tray in oven

turn on oven

sweep table

insert box

brush pan

spread sauce

pick toy to drawer

pour from cup to cup

clean pan

High precision tasks

Double Square

Arch

Square+Circle

Star

Rectangle

Round

Pear

Hexagon

Oval

Long tasks

chemistry experiments

clean table

make a pie

make coffee

make slices

wash pan

Bi-manipulation task

bimanual water pouring