Teaching Robots from Human Egocentric Videos: A Hands-On Look at HumanEgo
Running the serve_bread example to understand how human first-person videos are transformed into robot manipulation trajectories at HumanEgo.

HumanEgo is a research implementation from the University of Maryland for learning robot manipulation policies from human first-person videos.
Official project page:
https://humanego-ai.github.io/
At first glance, using human videos for robot learning sounds very attractive. Humans already perform many useful manipulation tasks, and first-person videos are relatively easy to collect.
However, there is a major challenge.
Human hands and arms look completely different from robot arms.
For example, during training, the video may contain:
human hands
human arms
five fingers
skin
But during real robot inference, the camera may see:
robot arms
grippers
metal or plastic links
cables
If we train a model directly on human videos, the model may learn the appearance of human arms instead of the essence of the manipulation task.
HumanEgo tries to bridge this gap by converting human videos into a shared representation that is more suitable for robot learning.
Conceptually, the pipeline looks like this:
Remove the human arm from the image
↓
Draw a virtual gripper at the human hand position
↓
Represent the 3D relationship between the hand and objects as ICT
↓
Train a policy that generates future end-effector trajectories
Here, a clean image refers to an image where the human arm has been removed. HumanEgo can also draw a virtual gripper on top of the image to show where the hand is and how it is interacting with the object.
ICT stands for Interaction-Centric Tokens. These are numerical tokens that represent hands and objects as 6DoF entities.
Goal of This Experiment
In this article, I am not trying to control a real robot right away.
Instead, I first want to run the public serve_bread example, which is a task where a piece of bread — maybe a croissant — is served or moved.
The goal is to understand the full offline pipeline in three steps:
Preprocessing
Convert human videos into training data.Training
Train a model that predicts future hand trajectories from images and ICT.Offline evaluation
Compare predicted trajectories with ground-truth trajectories on held-out data.
For this experiment, I use the two official serve_bread recordings.
I use:
mps_serve_bread_000_vrs
as evaluation data, and:
mps_serve_bread_001_vrs
and later recordings as training data.
First, clone the repository and set up the environment:
git clone https://github.com/TX-Leo/HumanEgo.git
cd HumanEgo
conda create -n humanego python=3.11 -y
conda activate humanego
PREDOWNLOAD=1 bash setup.sh
Downloading the Sample Data
For this article, I want to inspect what happens inside preprocessing. So I download only the input data, without the precomputed preprocessing outputs.
python scripts/download_data.py \
--task serve_bread \
--num 2 \
--input-only
The --input-only option skips the precomputed preprocess/ outputs. This allows us to run the preprocessing pipeline ourselves.
I download two recordings because the default HumanEgo training setup holds out the first recording for evaluation and uses the remaining recordings for training.
What Does Preprocessing Do?
Now let’s preprocess the first recording.
mkdir -p blog_logs
/usr/bin/time -v python -m preprocess.Preprocess \
--mps_path ./data/serve_bread/aria/mps_serve_bread_000_vrs \
--task serve_bread \
2>&1 | tee blog_logs/preprocess_000.log
Then run the same command for the second recording.
/usr/bin/time -v python -m preprocess.Preprocess \
--mps_path ./data/serve_bread/aria/mps_serve_bread_001_vrs \
--task serve_bread \
2>&1 | tee blog_logs/preprocess_001.log
Preprocessing is not just about exporting RGB images.
It creates several types of data needed for training. The overall flow looks like this:
Extract Aria RGB, hand, and SLAM data
↓
Trim the manipulation segment
↓
Use Grounding DINO + SAM2 to create object and arm masks
↓
Select keypoints on the objects
↓
Track 2D keypoints with CoTracker
↓
Lift 2D tracks into 3D using SLAM camera poses
↓
Remove the human arm with LaMa
↓
Render a virtual gripper and object keypoints
↓
Merge everything into training_data.json
This preprocessing step is where HumanEgo converts raw human egocentric video into a representation that can later be used for robot policy learning.
Looking at the Images Generated by Preprocessing
After preprocessing, each frame has several images saved under:
preprocess/all_data/<idx>/
Examples include:
rgb.png
rgb_WoArm.png
rgb_WArmObjKpts.png
rgb_WoArm_WArmObjKpts.png
Let’s look at what each of these means.
1. rgb.png
This is the original RGB image extracted from the Aria recording. The human hand and arm are still visible.
2. mask_arm.png
This is the mask of the human hand and arm extracted by DINO/SAM2. This mask defines the region that will later be removed by LaMa.
3. mask_obj1.png / mask_obj2.png
In the serve_bread task, the task configuration treats obj1 and obj2 as target objects. These object masks correspond to objects such as the bread and the plate.
One important point is that HumanEgo does not magically choose the task-relevant objects from every object in the image.
Instead, the preprocessing configuration specifies open-vocabulary detection prompts for each task.
For a custom task, you would define object prompts and hand prompts in a YAML file such as:
cfg/preprocess/tasks/<your_task>.yaml
4. rgb_WoArm.png
This is the image after the human arm has been inpainted by LaMa. The goal is to reduce the appearance gap between human demonstrations and robot execution.
5. rgb_WoArm_WArmObjKpts.png
This image removes the human arm and then draws a virtual gripper and object keypoints on top. The two-finger virtual gripper drawn in the image is a visual representation. It helps show where the hand is and what the gripper state looks like.
However, the ICT representation does not store the two fingertips directly.
Instead, the hand is represented as:
one 6DoF end-effector frame
+
a grasp value
What Does It Mean to Lift 2D Tracks into 3D?
One of the less intuitive parts of preprocessing is camtriangulator.
camtriangulator takes 2D tracks produced by CoTracker and uses SLAM camera poses to estimate 3D object poses.
Intuitively, it does something like this:
A 2D point in the image
Example: a point on the bread appears at u = u1, v = v1
↓
Track the same point across multiple frames
↓
Use SLAM to know the camera pose at each frame
↓
Intersect multiple viewing rays
↓
Recover a 3D point x, y, z
So HumanEgo is not reconstructing the entire image in 3D.
Instead, it estimates sparse 3D keypoints on the task-relevant objects.
You can inspect the related output files like this:
ls -lh \
"$MPS/preprocess/cotracker_results.json" \
"$MPS/preprocess/camtriangulator_results.json" \
"$MPS/preprocess/camtriangulator_vis.ply" \
"$MPS/preprocess/camtriangulator_vis.png"
Why is this 3D step necessary? Because a robot cannot act only from image pixels.
An image gives us:
u, v
But a robot needs targets like:
x, y, z
end-effector orientation
gripper open / close state
That is why HumanEgo builds a 3D relationship between the hand and objects, and later passes that relationship to the model as ICT.
What Is training_data.json?
One of the most important preprocessing outputs is:
training_data.json
This file is the source material for ICT.
Conceptually:
training_data.json
hand 4x4 pose
object 4x4 poses
grasp state
reference coordinate frame
image paths
↓
training dataloader
↓
x_ict tensor
Let’s inspect one example.
python - <<'PY'
import json
from pathlib import Path
mps = Path("./data/serve_bread/aria/mps_serve_bread_000_vrs")
paths = sorted(mps.glob("preprocess/all_data/*/training_data.json"))
print("num training_data:", len(paths))
path = paths[0]
print("sample:", path)
with path.open() as f:
d = json.load(f)
print("metadata keys:", d["metadata"].keys())
print("obs keys:", d["obs"].keys())
print("\nhands:")
for side, h in d["entities"]["hands"].items():
print(side, "grasp=", h["grasp"])
print("T_hand_to_world first row:", h["T_hand_to_world"][0])
print("\nobjects:")
for k, obj in d["entities"]["objects"].items():
print(k)
print("T_obj_to_world first row:", obj["T_obj_to_world"][0])
PY
Example output:
num training_data: 864
sample: data/serve_bread/aria/mps_serve_bread_000_vrs/preprocess/all_data/00128/training_data.json
metadata keys: dict_keys(['idx', 'ts', 'w', 'h', 'fps', 'k', 'c2w', 'anchor_key', 'is_finished', 'world_transforms'])
obs keys: dict_keys(['mask_arm_path', 'mask_obj_path', 'rgb_path', 'rgb_WArmObjKpts_path', 'rgb_WoArm_path', 'rgb_WoArm_WArmObjKpts_path'])
hands:
right grasp= 0.0
T_hand_to_world first row: [-0.9652508471493321, 0.2561715273038293, 0.051621888729812965, -0.7862446231727486]
objects:
obj1
T_obj_to_world first row: [-0.498511026341544, 0.7874135934764195, -0.3625832172394254, -0.707926009796382]
obj2
T_obj_to_world first row: [-0.24740682137970874, -0.24193038536352468, -0.9382214841767175, -0.6432356533752537]
This tells us that each frame contains information about:
image paths
camera metadata
hand pose
object poses
grasp state
The dataloader later turns these values into tensors used by the model.
What Is ICT?
ICT stands for Interaction-Centric Tokens.
ICT is a numerical representation separate from the image. It converts each hand and object into a vector token.
HumanEgo describes ICT as a viewpoint- and embodiment-invariant encoding. In practice, it represents each hand or object as a 6DoF entity, and also represents the hand pose relative to that entity.
A simplified view looks like this:
[right hand token]
[bread token]
[plate token]
[padding]
[padding]
...
In the single-hand setting, each token is roughly 20-dimensional.
[type_id] 1 dimension
[pose_in_ref] 9 dimensions
position xyz 3 dimensions
rotation rot6d 6 dimensions
[hand_in_entity] 9 dimensions
[flag] 1 dimension
-------------------------------
total 20 dimensions
Let’s generate ICT using the dataloader and inspect it.
python - <<'PY'
from training.FlowMatchingDataloader import FlowMatchingDataloader, MPSSessions
mps = "./data/serve_bread/aria/mps_serve_bread_000_vrs"
ds = FlowMatchingDataloader(
sessions=[MPSSessions(mps_path=mps)],
single_hand=True,
single_hand_side="right",
max_ict=8,
img_name="rgb_WoArm_WArmObjKpts.png",
use_legacy_image_loading=False,
enable_augmentation=False,
)
sample = ds[0]
print("json_path:", sample["json_path"])
print("x_rgb shape:", tuple(sample["x_rgb"].shape))
print("x_ict shape:", tuple(sample["x_ict"].shape))
print("ict_mask:", sample["ict_mask"].tolist())
print("y_action shape:", tuple(sample["y_action"].shape))
for i, valid in enumerate(sample["ict_mask"]):
if not bool(valid):
continue
tok = sample["x_ict"][i]
print(f"\nToken {i}")
print(" type_id :", tok[0].item())
print(" pose_in_ref xyz :", tok[1:4].tolist())
print(" pose_in_ref rot6d :", tok[4:10].tolist())
print(" hand_in_this xyz :", tok[10:13].tolist())
print(" hand_in_this rot6d :", tok[13:19].tolist())
print(" flag :", tok[19].item())
PY
Example output:
[FlowMatchingDataloader] Built 864 samples.
- Hand Method : aria_mps (key='hands')
- Image Input : rgb_WoArm_WArmObjKpts.png
- Frame Mode : ANCHOR_FRAME
- Action Mode : ABSOLUTE
- Aux : ObjDynamics=True | VisForesight=True | TempContrastive=True
- 3D PCD Feats : True
json_path: ./data/serve_bread/aria/mps_serve_bread_000_vrs/preprocess/all_data/00128/training_data.json
x_rgb shape: (3, 240, 320)
x_ict shape: (8, 20)
ict_mask: [True, True, True, False, False, False, False, False]
y_action shape: (50, 10)
Token 0
type_id : 2.0
pose_in_ref xyz : [0.16784971952438354, -0.02881561778485775, -0.07735089212656021]
pose_in_ref rot6d : [0.2640394866466522, -0.9627724289894104, -0.9046151041984558, -0.26802101731300354, 0.3345962166786194, 0.03512915223836899]
hand_in_this xyz : [1.1102230246251565e-16, 3.8163916471489756e-17, 0.0]
hand_in_this rot6d : [1.0, -5.938278001447962e-17, 5.938278001447962e-17, 1.0, 6.2144218264730725e-19, 2.566785074529357e-18]
flag : 0.0
Token 1
type_id : 3.0
pose_in_ref xyz : [-2.2069588823114827e-08, 7.90906717895723e-09, 2.1668267180530165e-08]
pose_in_ref rot6d : [1.0, 3.951016847025812e-09, -3.951016847025812e-09, 1.0, 1.3024523681792743e-08, -8.553106667363863e-09]
hand_in_this xyz : [0.16784973442554474, -0.028815625235438347, -0.0773509219288826]
hand_in_this rot6d : [0.2640394866466522, -0.9627724289894104, -0.9046151041984558, -0.26802101731300354, 0.3345962166786194, 0.03512916341423988]
flag : -1.0
Token 2
type_id : 4.0
pose_in_ref xyz : [-0.24469204246997833, -0.06749166548252106, 0.01143864169716835]
pose_in_ref rot6d : [0.962512731552124, 0.03149929270148277, 0.2607153654098511, 0.1665089875459671, -0.07481195032596588, 0.9855366945266724]
hand_in_this xyz : [0.4138026237487793, -0.0680706650018692, 0.060859549790620804]
hand_in_this rot6d : [-0.00673748878762126, -0.9991859197616577, 0.18744736909866333, -0.040333494544029236, 0.9822515845298767, 0.0008433718467131257]
flag : -1.0
One thing to note here: x_ict is not a file name. It is the name of a PyTorch tensor passed into the model.
Predicting a Future 50-Step Trajectory
During training, HumanEgo learns to predict a short future 6DoF hand motion from a first-person image and ICT.
In the default configuration, pred_horizon is 50.
This means the model predicts 50 future steps. If the Aria RGB stream is 30 FPS, this roughly corresponds to:
50 frames / 30 fps ≒ 1.67 seconds
However, the actual time span can change if you modify the stride or the control frequency during real robot inference.
The model does not predict the detailed positions of the two virtual gripper fingers.
Instead, for each future step, it predicts:
end-effector position
end-effector orientation
grasp value
In the single-hand setting, each step has 10 dimensions:
[x, y, z, r1, r2, r3, r4, r5, r6, grasp]
So a 50-step trajectory becomes:
50 × 10
Training
Training can be started with:
python -m training.FlowMatchingTrainer \
--task serve_bread \
--use_cfg \
--job HumanEgo \
2>&1 | tee blog_logs/train_400.log
In the default configuration:
epochs = 400
batch_size = 32
pred_horizon = 50
At a high level, training does this:
current image
+
current ICT
↓
learn to generate the future 50-step end-effector trajectory
It is important to clarify what is being trained here.
This training step does not retrain SAM2, Grounding DINO, LaMa, or CoTracker.
Those models are used during preprocessing.
The model trained by HumanEgo is the policy that uses:
processed images
hand and object poses
training_data.json
ICT
future hand trajectories
to generate a future end-effector trajectory from the current observation.
In other words, HumanEgo trains a trajectory generation policy. This policy tells the robot where the end-effector should move next.
HumanEgo uses Flow Matching as the learning method for this trajectory generation policy.
What Is Flow Matching?
Flow Matching is a general generative modeling method.
It learns a vector field that moves samples from a noise distribution toward the data distribution.
In HumanEgo, this idea is applied to future robot trajectory generation.
During training, we first have the ground-truth future trajectory from the human demonstration:
x1 = ground-truth future 50-step trajectory
Then we sample random noise with the same shape:
x0 = random trajectory
Next, we create an intermediate trajectory:
xt = (1 - t) x0 + t x1
The model is conditioned on the current image and ICT, and learns:
Given this intermediate trajectory xt,
which direction should it move to become closer to the ground-truth trajectory x1?
Intuitively, the model repeatedly practices the following task:
random future trajectory
↓
predict how to move it toward a human-demonstration-like trajectory
Treating action sequences as generative modeling targets has become increasingly common in robot learning.
The unique part of HumanEgo is not just Flow Matching itself.
The key idea is how HumanEgo converts human first-person videos into:
arm-removed images
virtual gripper images
ICT
future end-effector trajectories
and then uses them for robot policy learning.
What Is Produced After Training?
After training, we do not immediately run the policy on a real robot.
The first thing to inspect is the offline evaluation.
Offline evaluation uses a held-out recording. In the standard public sample setup, recording 000 is held out for evaluation, while the remaining recordings are used for training.
Conceptually, offline evaluation works like this:
current frame from evaluation data
├── processed image
└── ICT
↓
trained Flow Matching policy
↓
predicted future 50-step end-effector trajectory
↓
compare with the ground-truth future trajectory
The important point is that this is not a real robot success rate.
In offline evaluation, the image, ICT, and ground-truth trajectory are already available in the evaluation dataset.
This does not include many real-world robot problems, such as:
the robot pushing the object accidentally
the gripper failing to close correctly
camera-robot calibration error
object detection failure in a real environment
robot control latency
contact and slipping
So offline evaluation only measures how well the learned policy predicts trajectories on held-out data.
Understanding the Output Files
After training, results are saved under:
runs/serve_bread/HumanEgo/
The run directory contains files such as:
latest.pt
dataset_stats.json
config.json
train_curve.png
eval_curve.png
eval_snapshots/eval_ep_*.json
eval_render/epoch_*/
If we just look at this as a file list, it can be hard to understand what each file means.
A more useful way is to group them by role.
A. Trained policy weights
latest.pt
B. Information needed to reproduce or use the policy
config.json
dataset_stats.json
C. Information for checking training progress
train_curve.png
D. Numerical results from offline evaluation
eval_curve.png
eval_snapshots/eval_ep_*.json
E. Visual results from offline evaluation
eval_render/epoch_*/
The relationship between these files is:
training_data.json
↓
training
↓
latest.pt is created
↓
apply latest.pt to evaluation data
↓
compare predicted trajectories with ground-truth trajectories
↓
eval_curve.png / eval_snapshots / eval_render are created
So eval_curve.png and eval_render/ are not just extra files.
They are the outputs that help us check how well latest.pt predicts future trajectories on evaluation data.
Inference
The trained latest.pt can be used as policy weights for inference.
However, there are two different meanings of inference here.
The first one is offline inference.
existing evaluation data
↓
generate future trajectories
↓
compare them with ground truth
The second one is real robot inference.
real camera
+
real robot
+
perception
+
calibration
+
controller
↓
generate future end-effector trajectory
↓
move the robot
This article focuses on the first one: offline inference.
The offline inference flow is:
evaluation training_data.json
evaluation image
↓
dataloader generates ICT
↓
policy loaded from latest.pt
↓
generate future 50-step end-effector trajectory
↓
compare with ground truth
The results appear as:
eval_curve.png
eval_snapshots/
eval_render/
How to Read Offline Evaluation Results
When checking offline evaluation, I usually look at the outputs in this order:
1. train_curve.png
Did training itself proceed normally?
2. eval_curve.png
Did the evaluation error decrease?
3. eval_snapshots/eval_ep_*.json
What are the final quantitative metrics?
4. eval_render/epoch_*/
Do the predicted trajectories look natural?
train_curve.png and eval_curve.png look like this:
For example, the final evaluation video may be located at:
runs/serve_bread/HumanEgo/eval_render/epoch_0400/evaluation_vis.mp4
https://youtube.com/shorts/cVYoChnYpHg?feature=share
The main metrics from:
runs/serve_bread/HumanEgo/eval_snapshots/eval_ep_0400.json
look like this:
num snapshots: 400
latest: runs/serve_bread/HumanEgo/eval_snapshots/eval_ep_0400.json
epoch: 400
frames: 864
done_acc: 0.9085648148148148
pos_err_k1_m: 0.03651409354750757
pos_err_kK_m: 0.037472219930754766
rot_err_k1_deg: 5.2715283058307785
rot_err_kK_deg: 7.541413042280409
pos_err_w_m: 0.037069905003556954
rot_err_w_deg: 6.384099839530582
grasp_f1_k1: 0.8659217877094972
grasp_f1_kK: 0.8555478018143754
grasp_f1_w: 0.8580602059355177
zero_state_ratio: 0.0
Again, we need to be careful here.
Offline evaluation is not the same as real robot success rate.
Offline evaluation checks:
current image + current ICT
↓
predicted future trajectory
↓
comparison with ground-truth future trajectory
But in a real robot setup, many additional factors appear:
object pose estimation
camera-robot calibration
robot control
contact dynamics
slipping
grasp failure
latency
So even if offline evaluation looks good, it does not guarantee real-world robot success.
However, if offline evaluation is bad, it is a strong signal that we should check the training data, preprocessing results, ICT, and configuration before moving to a real robot.
What I Learned from This Experiment
After running HumanEgo on the serve_bread sample, the overall structure became much clearer to me.
Here is the big picture:
preprocess
Converts human videos into a format suitable for robot learning
training_data.json
Stores hand poses, object poses, camera poses, grasp states, and image paths
ICT
Numerical tokens generated from training_data.json during training and inference
training
Learns a policy that generates future end-effector trajectories from current images and ICT
latest.pt
Trained policy weights
inference
Generates future end-effector trajectories from clean images and ICT,
then passes them to a robot controller
The most important takeaway is that HumanEgo is not simply “training on human videos.”
It first transforms human egocentric demonstrations into a robot-friendly representation:
remove human appearance
represent interaction in 3D
encode hands and objects as ICT
learn future end-effector trajectories
This makes it easier to understand why preprocessing is so central to the method.
For me, the serve_bread example was useful because it showed the entire pipeline end to end:
raw human video
↓
preprocessing outputs
↓
training_data.json
↓
ICT
↓
Flow Matching policy training
↓
offline trajectory evaluation
Before trying real robot execution, I think this offline pipeline is the right place to start.
It helps answer a very practical question:
Can the model predict reasonable future trajectories from the processed human demonstration data?
Once that part looks reliable, the next challenge is connecting the learned policy to real robot perception, calibration, and control.





