Teaching Robots from Human Egocentric Videos: HumanEgo

HumanEgo is a research implementation from the University of Maryland for learning robot manipulation policies from human first-person videos.

Official project page:
https://humanego-ai.github.io/

At first glance, using human videos for robot learning sounds very attractive. Humans already perform many useful manipulation tasks, and first-person videos are relatively easy to collect.

However, there is a major challenge.

Human hands and arms look completely different from robot arms.

For example, during training, the video may contain:

human hands
human arms
five fingers
skin

But during real robot inference, the camera may see:

robot arms
grippers
metal or plastic links
cables

If we train a model directly on human videos, the model may learn the appearance of human arms instead of the essence of the manipulation task.

HumanEgo tries to bridge this gap by converting human videos into a shared representation that is more suitable for robot learning.

Conceptually, the pipeline looks like this:

Remove the human arm from the image
  ↓
Draw a virtual gripper at the human hand position
  ↓
Represent the 3D relationship between the hand and objects as ICT
  ↓
Train a policy that generates future end-effector trajectories

Here, a clean image refers to an image where the human arm has been removed. HumanEgo can also draw a virtual gripper on top of the image to show where the hand is and how it is interacting with the object.

ICT stands for Interaction-Centric Tokens. These are numerical tokens that represent hands and objects as 6DoF entities.

Goal of This Experiment

In this article, I am not trying to control a real robot right away.

Instead, I first want to run the public serve_bread example, which is a task where a piece of bread — maybe a croissant — is served or moved.

The goal is to understand the full offline pipeline in three steps:

Preprocessing
Convert human videos into training data.
Training
Train a model that predicts future hand trajectories from images and ICT.
Offline evaluation
Compare predicted trajectories with ground-truth trajectories on held-out data.

For this experiment, I use the two official serve_bread recordings.

I use:

mps_serve_bread_000_vrs

as evaluation data, and:

mps_serve_bread_001_vrs

and later recordings as training data.

First, clone the repository and set up the environment:

git clone https://github.com/TX-Leo/HumanEgo.git
cd HumanEgo

conda create -n humanego python=3.11 -y
conda activate humanego

PREDOWNLOAD=1 bash setup.sh

Downloading the Sample Data

For this article, I want to inspect what happens inside preprocessing. So I download only the input data, without the precomputed preprocessing outputs.

python scripts/download_data.py \
  --task serve_bread \
  --num 2 \
  --input-only

The --input-only option skips the precomputed preprocess/ outputs. This allows us to run the preprocessing pipeline ourselves.

I download two recordings because the default HumanEgo training setup holds out the first recording for evaluation and uses the remaining recordings for training.

What Does Preprocessing Do?

Now let’s preprocess the first recording.

mkdir -p blog_logs

/usr/bin/time -v python -m preprocess.Preprocess \
  --mps_path ./data/serve_bread/aria/mps_serve_bread_000_vrs \
  --task serve_bread \
  2>&1 | tee blog_logs/preprocess_000.log

Then run the same command for the second recording.

/usr/bin/time -v python -m preprocess.Preprocess \
  --mps_path ./data/serve_bread/aria/mps_serve_bread_001_vrs \
  --task serve_bread \
  2>&1 | tee blog_logs/preprocess_001.log

Preprocessing is not just about exporting RGB images.

It creates several types of data needed for training. The overall flow looks like this:

Extract Aria RGB, hand, and SLAM data
  ↓
Trim the manipulation segment
  ↓
Use Grounding DINO + SAM2 to create object and arm masks
  ↓
Select keypoints on the objects
  ↓
Track 2D keypoints with CoTracker
  ↓
Lift 2D tracks into 3D using SLAM camera poses
  ↓
Remove the human arm with LaMa
  ↓
Render a virtual gripper and object keypoints
  ↓
Merge everything into training_data.json

This preprocessing step is where HumanEgo converts raw human egocentric video into a representation that can later be used for robot policy learning.

Looking at the Images Generated by Preprocessing

After preprocessing, each frame has several images saved under:

preprocess/all_data/<idx>/

Examples include:

rgb.png
rgb_WoArm.png
rgb_WArmObjKpts.png
rgb_WoArm_WArmObjKpts.png

Let’s look at what each of these means.

1. `rgb.png`

This is the original RGB image extracted from the Aria recording. The human hand and arm are still visible.

2. `mask_arm.png`

This is the mask of the human hand and arm extracted by DINO/SAM2. This mask defines the region that will later be removed by LaMa.

3. `mask_obj1.png` / `mask_obj2.png`

In the serve_bread task, the task configuration treats obj1 and obj2 as target objects. These object masks correspond to objects such as the bread and the plate.

One important point is that HumanEgo does not magically choose the task-relevant objects from every object in the image.

Instead, the preprocessing configuration specifies open-vocabulary detection prompts for each task.

For a custom task, you would define object prompts and hand prompts in a YAML file such as:

cfg/preprocess/tasks/<your_task>.yaml

4. `rgb_WoArm.png`

This is the image after the human arm has been inpainted by LaMa. The goal is to reduce the appearance gap between human demonstrations and robot execution.

5. `rgb_WoArm_WArmObjKpts.png`

This image removes the human arm and then draws a virtual gripper and object keypoints on top. The two-finger virtual gripper drawn in the image is a visual representation. It helps show where the hand is and what the gripper state looks like.

However, the ICT representation does not store the two fingertips directly.

Instead, the hand is represented as:

one 6DoF end-effector frame
+
a grasp value

What Does It Mean to Lift 2D Tracks into 3D?

One of the less intuitive parts of preprocessing is camtriangulator.

camtriangulator takes 2D tracks produced by CoTracker and uses SLAM camera poses to estimate 3D object poses.

Intuitively, it does something like this:

A 2D point in the image
Example: a point on the bread appears at u = u1, v = v1
  ↓
Track the same point across multiple frames
  ↓
Use SLAM to know the camera pose at each frame
  ↓
Intersect multiple viewing rays
  ↓
Recover a 3D point x, y, z

So HumanEgo is not reconstructing the entire image in 3D.

Instead, it estimates sparse 3D keypoints on the task-relevant objects.

You can inspect the related output files like this:

ls -lh \
  "$MPS/preprocess/cotracker_results.json" \
  "$MPS/preprocess/camtriangulator_results.json" \
  "$MPS/preprocess/camtriangulator_vis.ply" \
  "$MPS/preprocess/camtriangulator_vis.png"

Why is this 3D step necessary? Because a robot cannot act only from image pixels.

An image gives us:

u, v

But a robot needs targets like:

x, y, z
end-effector orientation
gripper open / close state

That is why HumanEgo builds a 3D relationship between the hand and objects, and later passes that relationship to the model as ICT.

What Is `training_data.json`?

One of the most important preprocessing outputs is:

training_data.json

This file is the source material for ICT.

Conceptually:

training_data.json
  hand 4x4 pose
  object 4x4 poses
  grasp state
  reference coordinate frame
  image paths
        ↓
training dataloader
        ↓
x_ict tensor

Let’s inspect one example.

python - <<'PY'
import json
from pathlib import Path

mps = Path("./data/serve_bread/aria/mps_serve_bread_000_vrs")
paths = sorted(mps.glob("preprocess/all_data/*/training_data.json"))

print("num training_data:", len(paths))
path = paths[0]
print("sample:", path)

with path.open() as f:
    d = json.load(f)

print("metadata keys:", d["metadata"].keys())
print("obs keys:", d["obs"].keys())

print("\nhands:")
for side, h in d["entities"]["hands"].items():
    print(side, "grasp=", h["grasp"])
    print("T_hand_to_world first row:", h["T_hand_to_world"][0])

print("\nobjects:")
for k, obj in d["entities"]["objects"].items():
    print(k)
    print("T_obj_to_world first row:", obj["T_obj_to_world"][0])
PY

Example output:

num training_data: 864
sample: data/serve_bread/aria/mps_serve_bread_000_vrs/preprocess/all_data/00128/training_data.json
metadata keys: dict_keys(['idx', 'ts', 'w', 'h', 'fps', 'k', 'c2w', 'anchor_key', 'is_finished', 'world_transforms'])
obs keys: dict_keys(['mask_arm_path', 'mask_obj_path', 'rgb_path', 'rgb_WArmObjKpts_path', 'rgb_WoArm_path', 'rgb_WoArm_WArmObjKpts_path'])

hands:
right grasp= 0.0
T_hand_to_world first row: [-0.9652508471493321, 0.2561715273038293, 0.051621888729812965, -0.7862446231727486]

objects:
obj1
T_obj_to_world first row: [-0.498511026341544, 0.7874135934764195, -0.3625832172394254, -0.707926009796382]
obj2
T_obj_to_world first row: [-0.24740682137970874, -0.24193038536352468, -0.9382214841767175, -0.6432356533752537]

This tells us that each frame contains information about:

image paths
camera metadata
hand pose
object poses
grasp state

The dataloader later turns these values into tensors used by the model.

What Is ICT?

ICT stands for Interaction-Centric Tokens.

ICT is a numerical representation separate from the image. It converts each hand and object into a vector token.

HumanEgo describes ICT as a viewpoint- and embodiment-invariant encoding. In practice, it represents each hand or object as a 6DoF entity, and also represents the hand pose relative to that entity.

A simplified view looks like this:

[right hand token]
[bread token]
[plate token]
[padding]
[padding]
...

In the single-hand setting, each token is roughly 20-dimensional.

[type_id]                  1 dimension
[pose_in_ref]              9 dimensions
  position xyz             3 dimensions
  rotation rot6d           6 dimensions
[hand_in_entity]           9 dimensions
[flag]                     1 dimension
-------------------------------
total                      20 dimensions

Let’s generate ICT using the dataloader and inspect it.

python - <<'PY'
from training.FlowMatchingDataloader import FlowMatchingDataloader, MPSSessions

mps = "./data/serve_bread/aria/mps_serve_bread_000_vrs"

ds = FlowMatchingDataloader(
    sessions=[MPSSessions(mps_path=mps)],
    single_hand=True,
    single_hand_side="right",
    max_ict=8,
    img_name="rgb_WoArm_WArmObjKpts.png",
    use_legacy_image_loading=False,
    enable_augmentation=False,
)

sample = ds[0]

print("json_path:", sample["json_path"])
print("x_rgb shape:", tuple(sample["x_rgb"].shape))
print("x_ict shape:", tuple(sample["x_ict"].shape))
print("ict_mask:", sample["ict_mask"].tolist())
print("y_action shape:", tuple(sample["y_action"].shape))

for i, valid in enumerate(sample["ict_mask"]):
    if not bool(valid):
        continue

    tok = sample["x_ict"][i]
    print(f"\nToken {i}")
    print("  type_id              :", tok[0].item())
    print("  pose_in_ref xyz      :", tok[1:4].tolist())
    print("  pose_in_ref rot6d    :", tok[4:10].tolist())
    print("  hand_in_this xyz     :", tok[10:13].tolist())
    print("  hand_in_this rot6d   :", tok[13:19].tolist())
    print("  flag                 :", tok[19].item())
PY

Example output:

[FlowMatchingDataloader] Built 864 samples.
  - Hand Method  : aria_mps (key='hands')
  - Image Input  : rgb_WoArm_WArmObjKpts.png
  - Frame Mode   : ANCHOR_FRAME
  - Action Mode  : ABSOLUTE
  - Aux  : ObjDynamics=True | VisForesight=True | TempContrastive=True
  - 3D PCD Feats : True
json_path: ./data/serve_bread/aria/mps_serve_bread_000_vrs/preprocess/all_data/00128/training_data.json
x_rgb shape: (3, 240, 320)
x_ict shape: (8, 20)
ict_mask: [True, True, True, False, False, False, False, False]
y_action shape: (50, 10)

Token 0
  type_id              : 2.0
  pose_in_ref xyz      : [0.16784971952438354, -0.02881561778485775, -0.07735089212656021]
  pose_in_ref rot6d    : [0.2640394866466522, -0.9627724289894104, -0.9046151041984558, -0.26802101731300354, 0.3345962166786194, 0.03512915223836899]
  hand_in_this xyz     : [1.1102230246251565e-16, 3.8163916471489756e-17, 0.0]
  hand_in_this rot6d   : [1.0, -5.938278001447962e-17, 5.938278001447962e-17, 1.0, 6.2144218264730725e-19, 2.566785074529357e-18]
  flag                 : 0.0

Token 1
  type_id              : 3.0
  pose_in_ref xyz      : [-2.2069588823114827e-08, 7.90906717895723e-09, 2.1668267180530165e-08]
  pose_in_ref rot6d    : [1.0, 3.951016847025812e-09, -3.951016847025812e-09, 1.0, 1.3024523681792743e-08, -8.553106667363863e-09]
  hand_in_this xyz     : [0.16784973442554474, -0.028815625235438347, -0.0773509219288826]
  hand_in_this rot6d   : [0.2640394866466522, -0.9627724289894104, -0.9046151041984558, -0.26802101731300354, 0.3345962166786194, 0.03512916341423988]
  flag                 : -1.0

Token 2
  type_id              : 4.0
  pose_in_ref xyz      : [-0.24469204246997833, -0.06749166548252106, 0.01143864169716835]
  pose_in_ref rot6d    : [0.962512731552124, 0.03149929270148277, 0.2607153654098511, 0.1665089875459671, -0.07481195032596588, 0.9855366945266724]
  hand_in_this xyz     : [0.4138026237487793, -0.0680706650018692, 0.060859549790620804]
  hand_in_this rot6d   : [-0.00673748878762126, -0.9991859197616577, 0.18744736909866333, -0.040333494544029236, 0.9822515845298767, 0.0008433718467131257]
  flag                 : -1.0

One thing to note here: x_ict is not a file name. It is the name of a PyTorch tensor passed into the model.

Predicting a Future 50-Step Trajectory

During training, HumanEgo learns to predict a short future 6DoF hand motion from a first-person image and ICT.

In the default configuration, pred_horizon is 50.

This means the model predicts 50 future steps. If the Aria RGB stream is 30 FPS, this roughly corresponds to:

50 frames / 30 fps ≒ 1.67 seconds

However, the actual time span can change if you modify the stride or the control frequency during real robot inference.

The model does not predict the detailed positions of the two virtual gripper fingers.

Instead, for each future step, it predicts:

end-effector position
end-effector orientation
grasp value

In the single-hand setting, each step has 10 dimensions:

[x, y, z, r1, r2, r3, r4, r5, r6, grasp]

So a 50-step trajectory becomes:

50 × 10

Training

Training can be started with:

python -m training.FlowMatchingTrainer \
  --task serve_bread \
  --use_cfg \
  --job HumanEgo \
  2>&1 | tee blog_logs/train_400.log

In the default configuration:

epochs       = 400
batch_size   = 32
pred_horizon = 50

At a high level, training does this:

current image
+
current ICT
  ↓
learn to generate the future 50-step end-effector trajectory

It is important to clarify what is being trained here.

This training step does not retrain SAM2, Grounding DINO, LaMa, or CoTracker.

Those models are used during preprocessing.

The model trained by HumanEgo is the policy that uses:

processed images
hand and object poses
training_data.json
ICT
future hand trajectories

to generate a future end-effector trajectory from the current observation.

In other words, HumanEgo trains a trajectory generation policy. This policy tells the robot where the end-effector should move next.

HumanEgo uses Flow Matching as the learning method for this trajectory generation policy.

What Is Flow Matching?

Flow Matching is a general generative modeling method.

It learns a vector field that moves samples from a noise distribution toward the data distribution.

In HumanEgo, this idea is applied to future robot trajectory generation.

During training, we first have the ground-truth future trajectory from the human demonstration:

x1 = ground-truth future 50-step trajectory

Then we sample random noise with the same shape:

x0 = random trajectory

Next, we create an intermediate trajectory:

xt = (1 - t) x0 + t x1

The model is conditioned on the current image and ICT, and learns:

Given this intermediate trajectory xt,
which direction should it move to become closer to the ground-truth trajectory x1?

Intuitively, the model repeatedly practices the following task:

random future trajectory
  ↓
predict how to move it toward a human-demonstration-like trajectory

Treating action sequences as generative modeling targets has become increasingly common in robot learning.

The unique part of HumanEgo is not just Flow Matching itself.

The key idea is how HumanEgo converts human first-person videos into:

arm-removed images
virtual gripper images
ICT
future end-effector trajectories

and then uses them for robot policy learning.

What Is Produced After Training?

After training, we do not immediately run the policy on a real robot.

The first thing to inspect is the offline evaluation.

Offline evaluation uses a held-out recording. In the standard public sample setup, recording 000 is held out for evaluation, while the remaining recordings are used for training.

Conceptually, offline evaluation works like this:

current frame from evaluation data
  ├── processed image
  └── ICT
        ↓
trained Flow Matching policy
        ↓
predicted future 50-step end-effector trajectory
        ↓
compare with the ground-truth future trajectory

The important point is that this is not a real robot success rate.

In offline evaluation, the image, ICT, and ground-truth trajectory are already available in the evaluation dataset.

This does not include many real-world robot problems, such as:

the robot pushing the object accidentally
the gripper failing to close correctly
camera-robot calibration error
object detection failure in a real environment
robot control latency
contact and slipping

So offline evaluation only measures how well the learned policy predicts trajectories on held-out data.

Understanding the Output Files

After training, results are saved under:

runs/serve_bread/HumanEgo/

The run directory contains files such as:

latest.pt
dataset_stats.json
config.json
train_curve.png
eval_curve.png
eval_snapshots/eval_ep_*.json
eval_render/epoch_*/

If we just look at this as a file list, it can be hard to understand what each file means.

A more useful way is to group them by role.

A. Trained policy weights
   latest.pt

B. Information needed to reproduce or use the policy
   config.json
   dataset_stats.json

C. Information for checking training progress
   train_curve.png

D. Numerical results from offline evaluation
   eval_curve.png
   eval_snapshots/eval_ep_*.json

E. Visual results from offline evaluation
   eval_render/epoch_*/

The relationship between these files is:

training_data.json
        ↓
training
        ↓
latest.pt is created
        ↓
apply latest.pt to evaluation data
        ↓
compare predicted trajectories with ground-truth trajectories
        ↓
eval_curve.png / eval_snapshots / eval_render are created

So eval_curve.png and eval_render/ are not just extra files.

They are the outputs that help us check how well latest.pt predicts future trajectories on evaluation data.

Inference

The trained latest.pt can be used as policy weights for inference.

However, there are two different meanings of inference here.

The first one is offline inference.

existing evaluation data
  ↓
generate future trajectories
  ↓
compare them with ground truth

The second one is real robot inference.

real camera
+
real robot
+
perception
+
calibration
+
controller
  ↓
generate future end-effector trajectory
  ↓
move the robot

This article focuses on the first one: offline inference.

The offline inference flow is:

evaluation training_data.json
evaluation image
        ↓
dataloader generates ICT
        ↓
policy loaded from latest.pt
        ↓
generate future 50-step end-effector trajectory
        ↓
compare with ground truth

The results appear as:

eval_curve.png
eval_snapshots/
eval_render/

How to Read Offline Evaluation Results

When checking offline evaluation, I usually look at the outputs in this order:

1. train_curve.png
      Did training itself proceed normally?

2. eval_curve.png
      Did the evaluation error decrease?

3. eval_snapshots/eval_ep_*.json
      What are the final quantitative metrics?

4. eval_render/epoch_*/
      Do the predicted trajectories look natural?

train_curve.png and eval_curve.png look like this:

For example, the final evaluation video may be located at:

runs/serve_bread/HumanEgo/eval_render/epoch_0400/evaluation_vis.mp4

https://youtube.com/shorts/cVYoChnYpHg?feature=share

The main metrics from:

runs/serve_bread/HumanEgo/eval_snapshots/eval_ep_0400.json

look like this:

num snapshots: 400
latest: runs/serve_bread/HumanEgo/eval_snapshots/eval_ep_0400.json
epoch: 400
frames: 864
done_acc: 0.9085648148148148
pos_err_k1_m: 0.03651409354750757
pos_err_kK_m: 0.037472219930754766
rot_err_k1_deg: 5.2715283058307785
rot_err_kK_deg: 7.541413042280409
pos_err_w_m: 0.037069905003556954
rot_err_w_deg: 6.384099839530582
grasp_f1_k1: 0.8659217877094972
grasp_f1_kK: 0.8555478018143754
grasp_f1_w: 0.8580602059355177
zero_state_ratio: 0.0

Again, we need to be careful here.

Offline evaluation is not the same as real robot success rate.

Offline evaluation checks:

current image + current ICT
        ↓
predicted future trajectory
        ↓
comparison with ground-truth future trajectory

But in a real robot setup, many additional factors appear:

object pose estimation
camera-robot calibration
robot control
contact dynamics
slipping
grasp failure
latency

So even if offline evaluation looks good, it does not guarantee real-world robot success.

However, if offline evaluation is bad, it is a strong signal that we should check the training data, preprocessing results, ICT, and configuration before moving to a real robot.

What I Learned from This Experiment

After running HumanEgo on the serve_bread sample, the overall structure became much clearer to me.

Here is the big picture:

preprocess
  Converts human videos into a format suitable for robot learning

training_data.json
  Stores hand poses, object poses, camera poses, grasp states, and image paths

ICT
  Numerical tokens generated from training_data.json during training and inference

training
  Learns a policy that generates future end-effector trajectories from current images and ICT

latest.pt
  Trained policy weights

inference
  Generates future end-effector trajectories from clean images and ICT,
  then passes them to a robot controller

The most important takeaway is that HumanEgo is not simply “training on human videos.”

It first transforms human egocentric demonstrations into a robot-friendly representation:

remove human appearance
represent interaction in 3D
encode hands and objects as ICT
learn future end-effector trajectories

This makes it easier to understand why preprocessing is so central to the method.

For me, the serve_bread example was useful because it showed the entire pipeline end to end:

raw human video
  ↓
preprocessing outputs
  ↓
training_data.json
  ↓
ICT
  ↓
Flow Matching policy training
  ↓
offline trajectory evaluation

Before trying real robot execution, I think this offline pipeline is the right place to start.

It helps answer a very practical question:

Can the model predict reasonable future trajectories from the processed human demonstration data?

Once that part looks reliable, the next challenge is connecting the learned policy to real robot perception, calibration, and control.

Teaching Robots from Human Egocentric Videos: A Hands-On Look at HumanEgo

Goal of This Experiment

Downloading the Sample Data

What Does Preprocessing Do?

Looking at the Images Generated by Preprocessing

1. `rgb.png`

2. `mask_arm.png`

3. `mask_obj1.png` / `mask_obj2.png`

4. `rgb_WoArm.png`

5. `rgb_WoArm_WArmObjKpts.png`

What Does It Mean to Lift 2D Tracks into 3D?

What Is `training_data.json`?

What Is ICT?

Predicting a Future 50-Step Trajectory

Training

What Is Flow Matching?

What Is Produced After Training?

Understanding the Output Files

Inference

How to Read Offline Evaluation Results

What I Learned from This Experiment

Comments

More from this blog

I Tried TripoSplat to Generate 3D Gaussian Splatting from a Single Image — Then I Tried Animating a Plant

ERP in Manufacturing: The Operating System Behind the Business

Turning Real Objects into CAD Models

Same Prompt, Very Different UI, Comparing Codex With and Without `ui-ux-pro-max-skill`

Command Palette

Goal of This Experiment

Downloading the Sample Data

What Does Preprocessing Do?

Looking at the Images Generated by Preprocessing

1. rgb.png

2. mask_arm.png

3. mask_obj1.png / mask_obj2.png

4. rgb_WoArm.png

5. rgb_WoArm_WArmObjKpts.png

What Does It Mean to Lift 2D Tracks into 3D?

What Is training_data.json?

What Is ICT?

Predicting a Future 50-Step Trajectory

Training

What Is Flow Matching?

What Is Produced After Training?

Understanding the Output Files

Inference

How to Read Offline Evaluation Results

What I Learned from This Experiment

Comments

More from this blog

1. `rgb.png`

2. `mask_arm.png`

3. `mask_obj1.png` / `mask_obj2.png`

4. `rgb_WoArm.png`

5. `rgb_WoArm_WArmObjKpts.png`

What Is `training_data.json`?