Understanding EgoMimic by Running It

1. What is EgoMimic?

EgoMimic is a method for using first-person human videos and 3D hand trajectories as data for robot imitation learning.

https://egomimic.github.io/

In many previous approaches that use human videos, the human video is often used only to extract high-level intent: where the robot should move next, which object it should touch, or what subgoal it should aim for. That high-level information is then passed to a separate robot policy.

EgoMimic takes a different approach. It treats human data not just as auxiliary information, but as something closer to an embodied demonstration, similar in spirit to robot demonstrations.

Human:
  Aria image + 3D hand trajectory
        ↓
  Converted into a format useful for robot learning
        ↓
  Trained together with robot data

Robot:
  Aria image + wrist image + joint action
        ↓
  Learn an executable robot policy

EgoMimic is a full-stack framework that combines several pieces:

human data collection with Meta's Project Aria,
Aria and wrist cameras on the robot side,
cross-domain alignment between humans and robots,
and a policy architecture for co-training on human and robot demonstrations.

The paper describes four major components: Project Aria, a low-cost bimanual robot setup, cross-domain alignment, and a human/robot co-training architecture.

Credit: https://egomimic.github.io/

Sensor setup: Human Aria, Robot Aria, and wrist cameras

EgoMimic uses Project Aria on both the human side and the robot side to make their viewpoints as similar as possible.

On the human side, the operator wears Project Aria. Aria's forward-facing RGB camera is used as the first-person image. The side SLAM cameras are used for device pose estimation and hand tracking.

The robot also carries Project Aria. This is one of the interesting design choices in EgoMimic. The robot's “head” image is not captured by a custom robot-only camera. Instead, the image comes from a Project Aria device mounted on the upper body of the robot. This reduces the camera-domain gap between human Aria images and robot images. In addition, the robot has Intel RealSense D405 wrist cameras so that the policy can see the manipulation area in more detail.

Data source	Sensor	Role
Human Aria	Project Aria worn by a human	First-person RGB, SLAM, and hand tracking
Robot Aria	Project Aria mounted on the upper body of the robot	Robot first-person view close to the human viewpoint
Robot Wrist Camera	RealSense D405 on the robot wrist	Local view for fine manipulation

How human data and robot data differ

EgoMimic trains on human and robot data together. However, the two data types are not identical.

Human data contains first-person images, 3D hand poses, and future hand trajectories. Robot data contains first-person images, wrist-camera images, end-effector poses, joint positions, and joint actions.

Item	Human data	Robot data
Images	Aria egocentric RGB	Robot Aria RGB + wrist views
State	3D hand pose	End-effector pose + joint positions
Action	Future hand pose trajectory	Future end-effector pose + joint action
Gripper action	Not directly available	Included in joint action
Role in training	Teaches how the hand pose moves	Teaches executable robot joint actions

The key point is that human data does not contain robot joint actions.

From human Aria data, we can obtain the position and trajectory of the human hand. But from human data alone, we cannot directly learn how to move the robot's shoulder, elbow, wrist, or gripper.

EgoMimic handles this by using hand/end-effector pose prediction as a shared bridge between humans and robots. The robot-executable joint actions are still learned from robot demonstrations.

The three gaps EgoMimic tries to close

Simply mixing human data and robot data is not enough. EgoMimic mainly addresses three kinds of human-robot gaps:

1. Coordinate-frame gap
2. Action-distribution gap
3. Visual-appearance gap

The paper also describes the human-robot gap in terms of action coordinate frames, action distributions, and visual appearance.

1. Coordinate-frame gap

A human wears Aria on the head, so the camera moves during the task. When the person reaches out, looks around, or changes posture, the camera coordinate frame changes over time.

If we use future hand positions as they are, each future point may belong to a different camera coordinate frame. That is an awkward representation for the model.

EgoMimic uses the Aria pose estimated by SLAM to transform the future hand trajectory into the current camera coordinate frame. This allows the model to predict future hand motion relative to the image it is seeing now.

2. Action-distribution gap

The movement distribution of a human hand and that of a robot end-effector are different, even for the same task.

For example, a human arm has flexible shoulder and elbow motion. A robot arm, on the other hand, is constrained by its mounting position and joint limits. As a result, even when both are reaching for the same object, the center and scale of their end-effector coordinates can be different.

If we mix these distributions directly, the model may simply learn separate patterns such as “this is the human distribution” and “this is the robot distribution.”

To reduce this issue, EgoMimic applies Z-score normalization separately for the human and robot domains.

x_norm = (x - mean) / std

In other words, for each data source, EgoMimic subtracts the mean and divides by the standard deviation.

3. Visual-appearance gap

Human hands and robot arms look very different. Human images may contain skin and sleeves, while robot images show metal or plastic arms.

If the model is trained directly on raw images, it may separate human and robot data based on appearance rather than focusing on task-relevant motion.

EgoMimic addresses this by using SAM/SAM2 to mask the human hand or robot arm, and then overlaying a red line.

The mask reduces appearance differences such as skin, sleeves, and robot hardware. The red line keeps useful directional information about the hand or arm.

EgoMimic is ACT-based action chunking, not Flow Matching

EgoMimic uses ACT, or Action Chunking Transformer, as its policy backbone.

Given the current image and state, the model predicts a chunk of future actions. In this setup, the chunk contains 100 future points. For robot data, EgoMimic learns future end-effector pose actions and joint actions. For human data, it learns future hand pose actions.

2. What this post tries to check

In this post, I am not trying to fully reproduce the paper results. Instead, I want to check the core mechanics of EgoMimic on my own machine.

Specifically, I will look at the following:

What I try	Purpose
Open the public HDF5 data	Check how human and robot data are stored
Extract images	Inspect Aria images, wrist images, masked images, and line images
Visualize action distributions	Understand the human/robot distribution gap and normalization
Debug-run EgoMimic itself	Check the training loop that reads human and robot HDF5 files together
Look at validation videos	Understand how the offline visualization should be interpreted

3. Public HDF5 data

EgoMimic's public data is stored in HDF5 format. HDF5 is a data format that can store a folder-like hierarchy inside a single file. It can contain images, joint states, actions, train/validation splits, and other arrays in one .hdf5 file.

HDF5 mainly consists of three concepts:

Term	Meaning
Group	Similar to a folder
Dataset	The actual data, similar to a NumPy array
Attribute	Metadata attached to a group or dataset

At a high level, EgoMimic's HDF5 structure looks like this:

bowlplace_robot.hdf5
├── data
│   ├── demo_0
│   │   ├── obs
│   │   │   ├── front_img_1
│   │   │   ├── front_img_1_line
│   │   │   ├── front_img_1_masked
│   │   │   ├── right_wrist_img
│   │   │   ├── joint_positions
│   │   │   └── ee_pose
│   │   ├── actions_joints_act
│   │   └── actions_xyz_act
│   └── ...
└── mask
    ├── train
    ├── valid
    ├── demo_train
    └── demo_valid

Under data, each demo stores images, states, and actions. Under mask, the file stores lists of which demos should be used for training and which demos should be used for validation.

Inspecting the HDF5 contents

The following script prints the HDF5 structure, dataset shapes, and dtypes.

#!/usr/bin/env python3

import argparse
from pathlib import Path

import h5py


def print_hdf5_tree(path: Path, max_items: int = 120):
    """Print dataset paths, shapes, and dtypes in an HDF5 file."""
    count = 0

    with h5py.File(path, "r") as f:
        print("Top-level keys:", list(f.keys()))

        def visitor(name, obj):
            nonlocal count
            if count >= max_items:
                return
            if isinstance(obj, h5py.Dataset):
                print(f"{name:60s} shape={obj.shape} dtype={obj.dtype}")
                count += 1

        f.visititems(visitor)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--dataset", required=True, type=Path)
    parser.add_argument("--max-items", type=int, default=120)
    args = parser.parse_args()

    print_hdf5_tree(args.dataset, args.max_items)


if __name__ == "__main__":
    main()

Example commands:

python tools/print_hdf5_tree.py \
  --dataset datasets/bowlplace_robot.hdf5 \
  --max-items 80

python tools/print_hdf5_tree.py \
  --dataset datasets/bowlplace_human.hdf5 \
  --max-items 80

In my Bowlplace check, the robot HDF5 was 41.29 GB. It had four top-level demo groups, and each demo had 3,000 samples. That gives 12,000 timesteps in total, which is about 4 minutes at 50 Hz.

The human HDF5 was 37.85 GB. It had 50 top-level demo groups and 14,685 timesteps in total, which is about 8.16 minutes at 30 Hz.

File	Size	Top-level demo groups	Timesteps	Approx. duration
`bowlplace_robot.hdf5`	41.29 GB	4	12,000	About 4.00 min
`bowlplace_human.hdf5`	37.85 GB	50	14,685	About 8.16 min

Extracting images from HDF5

Next, I extracted actual images from the HDF5 files.

For the robot data, the following image keys are useful to inspect:

obs/front_img_1
obs/front_img_1_masked
obs/front_img_1_line
obs/right_wrist_img

For the human data, I used these keys:

obs/front_img_1
obs/front_img_1_masked
obs/front_img_1_line

Here is a small script that saves one image from a specified demo and frame index.

#!/usr/bin/env python3

import argparse
from pathlib import Path

import h5py
from PIL import Image


def save_image(hdf5_path: Path, demo: str, key: str, index: int, output: Path):
    """Save one image frame from an HDF5 dataset."""
    with h5py.File(hdf5_path, "r") as f:
        dataset_path = f"data/{demo}/{key}"
        if dataset_path not in f:
            raise KeyError(f"Dataset not found: {dataset_path}")

        image = f[dataset_path][index]

    output.parent.mkdir(parents=True, exist_ok=True)
    Image.fromarray(image).save(output)
    print(f"Saved: {output}")


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--dataset", required=True, type=Path)
    parser.add_argument("--demo", required=True)
    parser.add_argument("--key", required=True)
    parser.add_argument("--index", type=int, default=0)
    parser.add_argument("--output", required=True, type=Path)
    args = parser.parse_args()

    save_image(args.dataset, args.demo, args.key, args.index, args.output)


if __name__ == "__main__":
    main()

Example commands:

python tools/save_hdf5_image.py \
  --dataset datasets/bowlplace_robot.hdf5 \
  --demo demo_0 \
  --key obs/front_img_1 \
  --index 0 \
  --output blog_outputs/robot_front_raw.png

python tools/save_hdf5_image.py \
  --dataset datasets/bowlplace_robot.hdf5 \
  --demo demo_0 \
  --key obs/front_img_1_line \
  --index 0 \
  --output blog_outputs/robot_front_line.png

python tools/save_hdf5_image.py \
  --dataset datasets/bowlplace_robot.hdf5 \
  --demo demo_0 \
  --key obs/right_wrist_img \
  --index 0 \
  --output blog_outputs/robot_right_wrist.png

python tools/save_hdf5_image.py \
  --dataset datasets/bowlplace_human.hdf5 \
  --demo demo_0 \
  --key obs/front_img_1_line \
  --index 0 \
  --output blog_outputs/human_front_line.png

4. Visualizing action distributions and normalization

After checking the images, I looked at the actions.

EgoMimic does not only predict the next single action step. Instead, it handles multiple future action steps together. This is called an action chunk.

For example, in my HDF5 file, the robot-side actions_joints_act had this shape:

actions_joints_act: (3000, 100, 7)

This means:

For each of 3,000 timesteps,
the dataset stores 100 future points,
and each future point has a 7-dimensional joint action.

Similarly, the pose action actions_xyz_act had this shape:

actions_xyz_act: (3000, 100, 3)

This represents 100 future points of 3D pose action. In my HDF5 inspection, the robot-side actions_joints_act was (3000, 100, 7), and actions_xyz_act was (3000, 100, 3). The human-side actions_xyz_act was also stored in shapes such as (300, 100, 3) for each demo.

Plotting the action distribution

Here, I extracted actions_xyz_act from both the human and robot HDF5 files, and compared dim=1. In many coordinate conventions, dim=1 is the y direction, so it is useful to think of it as the left-right component.

#!/usr/bin/env python3

import argparse
from pathlib import Path

import h5py
import numpy as np
import matplotlib.pyplot as plt


def collect_action_values(hdf5_path: Path, key: str, max_rows: int):
    """Collect action values from all demos in an HDF5 file."""
    values = []

    with h5py.File(hdf5_path, "r") as f:
        for demo in f["data"].keys():
            dataset_path = f"data/{demo}/{key}"
            if dataset_path not in f:
                continue

            arr = f[dataset_path][()]
            arr = arr.reshape(-1, arr.shape[-1])
            values.append(arr)

    if not values:
        raise RuntimeError(f"No action data found for key: {key}")

    out = np.concatenate(values, axis=0)
    if len(out) > max_rows:
        out = out[:max_rows]
    return out


def zscore(x):
    """Apply per-dataset Z-score normalization."""
    return (x - x.mean(axis=0, keepdims=True)) / (x.std(axis=0, keepdims=True) + 1e-6)


def plot_distribution(human, robot, dim: int, title: str, output: Path):
    """Plot human and robot action distributions."""
    plt.figure()
    plt.hist(human[:, dim], bins=80, alpha=0.5, density=True, label="human")
    plt.hist(robot[:, dim], bins=80, alpha=0.5, density=True, label="robot")
    plt.xlabel(f"actions_xyz_act dim={dim}")
    plt.ylabel("density")
    plt.title(title)
    plt.legend()
    plt.tight_layout()
    output.parent.mkdir(parents=True, exist_ok=True)
    plt.savefig(output, dpi=160)
    plt.close()


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--human", required=True, type=Path)
    parser.add_argument("--robot", required=True, type=Path)
    parser.add_argument("--dim", type=int, default=1)
    parser.add_argument("--out-dir", required=True, type=Path)
    parser.add_argument("--max-rows", type=int, default=50000)
    args = parser.parse_args()

    human = collect_action_values(args.human, "actions_xyz_act", args.max_rows)
    robot = collect_action_values(args.robot, "actions_xyz_act", args.max_rows)

    plot_distribution(
        human,
        robot,
        args.dim,
        "Before normalization",
        args.out_dir / "action_distribution_before.png",
    )

    plot_distribution(
        zscore(human),
        zscore(robot),
        args.dim,
        "After per-domain Z-score normalization",
        args.out_dir / "action_distribution_after.png",
    )


if __name__ == "__main__":
    main()

Run it like this:

python tools/plot_action_distribution.py \
  --human datasets/bowlplace_human.hdf5 \
  --robot datasets/bowlplace_robot.hdf5 \
  --dim 1 \
  --out-dir blog_outputs/action_distribution

5. Debug-running EgoMimic itself

Next, I ran a small debug training job with the EgoMimic code.

The EgoMimic training code reads the robot HDF5 and the human HDF5 at the same time.

On the robot side, it learns future joint actions and future end-effector pose actions. On the human side, it learns future hand pose actions.

The training flow looks roughly like this:

Human batch:
  front_img_1_line + ee_pose
        ↓
  shared encoder / transformer
        ↓
  future hand pose chunk
        ↓
  pose loss

Robot batch:
  front_img_1_line + right_wrist_img + joint_positions
        ↓
  shared encoder / transformer
        ↓
  future EEF pose chunk + future joint action chunk
        ↓
  pose loss + joint loss

Checking the train/validation split

In my robot HDF5 file, the original mask/train and mask/valid pointed to the same four demos. Using them as-is caused a train/validation overlap error. For debugging, I created new demo_train and demo_valid splits.

For the robot data, I used this split:

mask/demo_train: demo_0, demo_1, demo_10
mask/demo_valid: demo_11

I also created demo_train and demo_valid with the same names on the human side. EgoMimic reads both the robot and human HDF5 files, so both files need train/validation splits.

In my inspection logs, I confirmed that the robot side had three demo_train entries and one demo_valid entry, while the human side had 49 demo_train entries and one demo_valid entry.

Running EgoMimic

The command I used was:

cd EgoMimic/egomimic

python scripts/pl_train.py \
  --config configs/egomimic_oboo_demo_split.json \
  --dataset ../datasets/bowlplace_robot.hdf5 \
  --dataset_2 ../datasets/bowlplace_human.hdf5 \
  --debug \
  --no-wandb \
  --output_dir ../trained_models_blog \
  --description egomimic_demo_split_debug

The important point is that I used a config that refers to demo_train and demo_valid, instead of the original configs/egomimic_oboo.json.

The --debug flag is not meant for getting good performance. It is useful for quickly checking whether the HDF5 files can be loaded, whether the model can be initialized, whether forward/backward passes work, and whether validation videos are generated.

How to read the validation video: green, purple, and red

During training, EgoMimic outputs validation videos. These videos are not real robot rollouts. They are offline visualizations created from validation data in the HDF5 file. The video overlays the model's predicted actions and the ground-truth actions.

A useful way to interpret the colors is:

Color	Meaning
Green	Ground-truth future action chunk stored in the HDF5 file
Purple	Future joint action chunk predicted by the joint-action head
Red	Future pose action chunk predicted by the pose head
Red line on the input image	The mask/line processing already included in `front_img_1_line`

The green trajectory is not produced by the model. It is the ground truth. So if the green line looks like a trajectory, that means the HDF5 file contains a future ground-truth trajectory.

Purple and red are model predictions. In my debug run, purple and red looked almost like dots, with only small jittering motion. I interpret this as the predicted 100 future points collapsing into almost the same location.

So the validation video can be read like this:

Green:
  Ground-truth future trajectory.
  It appears as a line because the HDF5 file contains the future trajectory.

Purple:
  Joint action prediction.
  In this debug run, the model has not learned enough yet, so the prediction collapses into a dot-like shape.

Red:
  Pose action prediction.
  Similarly, the 100 future points do not spread out much and appear almost dot-like.

This does not mean the visualization is broken. In fact, this is a natural result for a debug run. I used the sample dataset and trained only briefly with --debug, so this setup is not expected to reproduce the performance reported in the paper.

6. What I learned, and what I did not verify

Here is what I was able to confirm in this hands-on check:

What I confirmed
EgoMimic reads human HDF5 and robot HDF5 files at the same time
The HDF5 structure contains images, poses, and joint actions
`front_img_1_line` is already processed with mask + red line
`actions_xyz_act` and `actions_joints_act` are 100-step future action chunks
Per-domain normalization helps align human and robot action distributions
EgoMimic's debug training run works end to end
The validation video can be interpreted as an offline overlay of ground truth and predictions

At the same time, this experiment was not a performance evaluation. I did not reproduce the paper's reported scores, and I did not verify whether the trained model succeeds on a real robot.

The paper reports that, on the Object-in-Bowl dataset, EgoMimic trained with 2 hours of robot data plus 1 hour of hand data outperformed ACT trained with 3 hours of robot data.

This result captures the main idea of EgoMimic well: instead of only collecting more robot data, adding first-person human hand-trajectory data may improve robot learning more efficiently.

7. Conclusion

EgoMimic is a method for using first-person human videos and 3D hand trajectories in robot imitation learning.

This debug experiment does not reproduce the paper's full performance. Still, by looking step by step at the HDF5 structure, Aria images, mask/line images, action distributions, normalization, the debug training run, and the validation video, I was able to understand what EgoMimic takes as input, what it predicts, and why human data can be mixed into robot policy learning.

For me, the most useful takeaway is that EgoMimic does not magically convert human demonstrations into robot joint actions. Instead, it creates a shared pose-prediction bridge between human hand motion and robot end-effector motion, while still using robot data to learn executable joint actions.

That design makes the method feel much more concrete: human videos provide extra embodied motion examples, and robot demonstrations teach how those motions should become robot actions.

Understanding EgoMimic by Running It: How First-Person Human Videos Can Help Robot Imitation Learning

1. What is EgoMimic?

Sensor setup: Human Aria, Robot Aria, and wrist cameras

How human data and robot data differ

The three gaps EgoMimic tries to close

1. Coordinate-frame gap

2. Action-distribution gap

3. Visual-appearance gap

EgoMimic is ACT-based action chunking, not Flow Matching

2. What this post tries to check

3. Public HDF5 data

Inspecting the HDF5 contents

Extracting images from HDF5

4. Visualizing action distributions and normalization

Plotting the action distribution

5. Debug-running EgoMimic itself

Checking the train/validation split

Running EgoMimic

How to read the validation video: green, purple, and red

6. What I learned, and what I did not verify

7. Conclusion

Comments

More from this blog

Teaching Robots from Human Egocentric Videos: A Hands-On Look at HumanEgo

I Tried TripoSplat to Generate 3D Gaussian Splatting from a Single Image — Then I Tried Animating a Plant

ERP in Manufacturing: The Operating System Behind the Business

Turning Real Objects into CAD Models

Command Palette

1. What is EgoMimic?

Sensor setup: Human Aria, Robot Aria, and wrist cameras

How human data and robot data differ

The three gaps EgoMimic tries to close

1. Coordinate-frame gap

2. Action-distribution gap

3. Visual-appearance gap

EgoMimic is ACT-based action chunking, not Flow Matching

2. What this post tries to check

3. Public HDF5 data

Inspecting the HDF5 contents

Extracting images from HDF5

4. Visualizing action distributions and normalization

Plotting the action distribution

5. Debug-running EgoMimic itself

Checking the train/validation split

Running EgoMimic

How to read the validation video: green, purple, and red

6. What I learned, and what I did not verify

7. Conclusion

Comments

More from this blog