# Teaching Robots from Human Egocentric Videos: A Hands-On Look at HumanEgo

HumanEgo is a research implementation from the University of Maryland for learning robot manipulation policies from human first-person videos.

Official project page:  
[https://humanego-ai.github.io/](https://humanego-ai.github.io/)

At first glance, using human videos for robot learning sounds very attractive. Humans already perform many useful manipulation tasks, and first-person videos are relatively easy to collect.

However, there is a major challenge.

Human hands and arms look completely different from robot arms.

For example, during training, the video may contain:

```text
human hands
human arms
five fingers
skin
```

But during real robot inference, the camera may see:

```text
robot arms
grippers
metal or plastic links
cables
```

If we train a model directly on human videos, the model may learn the appearance of human arms instead of the essence of the manipulation task.

HumanEgo tries to bridge this gap by converting human videos into a shared representation that is more suitable for robot learning.

Conceptually, the pipeline looks like this:

```text
Remove the human arm from the image
  ↓
Draw a virtual gripper at the human hand position
  ↓
Represent the 3D relationship between the hand and objects as ICT
  ↓
Train a policy that generates future end-effector trajectories
```

Here, a clean image refers to an image where the human arm has been removed. HumanEgo can also draw a virtual gripper on top of the image to show where the hand is and how it is interacting with the object.

ICT stands for **Interaction-Centric Tokens**. These are numerical tokens that represent hands and objects as 6DoF entities.

* * *

## Goal of This Experiment

In this article, I am not trying to control a real robot right away.

Instead, I first want to run the public `serve_bread` example, which is a task where a piece of bread — maybe a croissant — is served or moved.

The goal is to understand the full offline pipeline in three steps:

1.  **Preprocessing**  
    Convert human videos into training data.
    
2.  **Training**  
    Train a model that predicts future hand trajectories from images and ICT.
    
3.  **Offline evaluation**  
    Compare predicted trajectories with ground-truth trajectories on held-out data.
    

For this experiment, I use the two official `serve_bread` recordings.

I use:

```text
mps_serve_bread_000_vrs
```

as evaluation data, and:

```text
mps_serve_bread_001_vrs
```

and later recordings as training data.

First, clone the repository and set up the environment:

```bash
git clone https://github.com/TX-Leo/HumanEgo.git
cd HumanEgo

conda create -n humanego python=3.11 -y
conda activate humanego

PREDOWNLOAD=1 bash setup.sh
```

* * *

## Downloading the Sample Data

For this article, I want to inspect what happens inside preprocessing. So I download only the input data, without the precomputed preprocessing outputs.

```bash
python scripts/download_data.py \
  --task serve_bread \
  --num 2 \
  --input-only
```

The `--input-only` option skips the precomputed `preprocess/` outputs. This allows us to run the preprocessing pipeline ourselves.

I download two recordings because the default HumanEgo training setup holds out the first recording for evaluation and uses the remaining recordings for training.

* * *

## What Does Preprocessing Do?

Now let’s preprocess the first recording.

```bash
mkdir -p blog_logs

/usr/bin/time -v python -m preprocess.Preprocess \
  --mps_path ./data/serve_bread/aria/mps_serve_bread_000_vrs \
  --task serve_bread \
  2>&1 | tee blog_logs/preprocess_000.log
```

Then run the same command for the second recording.

```bash
/usr/bin/time -v python -m preprocess.Preprocess \
  --mps_path ./data/serve_bread/aria/mps_serve_bread_001_vrs \
  --task serve_bread \
  2>&1 | tee blog_logs/preprocess_001.log
```

Preprocessing is not just about exporting RGB images.

It creates several types of data needed for training. The overall flow looks like this:

```text
Extract Aria RGB, hand, and SLAM data
  ↓
Trim the manipulation segment
  ↓
Use Grounding DINO + SAM2 to create object and arm masks
  ↓
Select keypoints on the objects
  ↓
Track 2D keypoints with CoTracker
  ↓
Lift 2D tracks into 3D using SLAM camera poses
  ↓
Remove the human arm with LaMa
  ↓
Render a virtual gripper and object keypoints
  ↓
Merge everything into training_data.json
```

This preprocessing step is where HumanEgo converts raw human egocentric video into a representation that can later be used for robot policy learning.

* * *

## Looking at the Images Generated by Preprocessing

After preprocessing, each frame has several images saved under:

```text
preprocess/all_data/<idx>/
```

Examples include:

```text
rgb.png
rgb_WoArm.png
rgb_WArmObjKpts.png
rgb_WoArm_WArmObjKpts.png
```

Let’s look at what each of these means.

* * *

### 1\. `rgb.png`

This is the original RGB image extracted from the Aria recording. The human hand and arm are still visible.

![](https://cdn.hashnode.com/uploads/covers/69ce85bb0ff860b6def2c2ab/b2be3ff1-2d19-4df1-9b6c-1d2ff93279f1.png align="center")

### 2\. `mask_arm.png`

This is the mask of the human hand and arm extracted by DINO/SAM2. This mask defines the region that will later be removed by LaMa.

![](https://cdn.hashnode.com/uploads/covers/69ce85bb0ff860b6def2c2ab/1b4afc7a-e388-49c5-8bcc-5b50478d706e.png align="center")

### 3\. `mask_obj1.png` / `mask_obj2.png`

In the `serve_bread` task, the task configuration treats `obj1` and `obj2` as target objects. These object masks correspond to objects such as the bread and the plate.

![](https://cdn.hashnode.com/uploads/covers/69ce85bb0ff860b6def2c2ab/4552fbf8-91ab-453c-b62e-2b1e32ec9820.png align="center")

![](https://cdn.hashnode.com/uploads/covers/69ce85bb0ff860b6def2c2ab/a3d6f5e7-9304-4da3-a120-37c29a32cef8.png align="center")

One important point is that HumanEgo does not magically choose the task-relevant objects from every object in the image.

Instead, the preprocessing configuration specifies open-vocabulary detection prompts for each task.

For a custom task, you would define object prompts and hand prompts in a YAML file such as:

```text
cfg/preprocess/tasks/<your_task>.yaml
```

### 4\. `rgb_WoArm.png`

This is the image after the human arm has been inpainted by LaMa. The goal is to reduce the appearance gap between human demonstrations and robot execution.

![](https://cdn.hashnode.com/uploads/covers/69ce85bb0ff860b6def2c2ab/d41125ab-603b-4dfb-a110-234e7655371b.png align="center")

### 5\. `rgb_WoArm_WArmObjKpts.png`

This image removes the human arm and then draws a virtual gripper and object keypoints on top. The two-finger virtual gripper drawn in the image is a visual representation. It helps show where the hand is and what the gripper state looks like.

![](https://cdn.hashnode.com/uploads/covers/69ce85bb0ff860b6def2c2ab/d7726f3b-c73c-4099-88a6-bb56cae28dfc.png align="center")

However, the ICT representation does not store the two fingertips directly.

Instead, the hand is represented as:

```text
one 6DoF end-effector frame
+
a grasp value
```

* * *

## What Does It Mean to Lift 2D Tracks into 3D?

One of the less intuitive parts of preprocessing is `camtriangulator`.

`camtriangulator` takes 2D tracks produced by CoTracker and uses SLAM camera poses to estimate 3D object poses.

Intuitively, it does something like this:

```text
A 2D point in the image
Example: a point on the bread appears at u = u1, v = v1
  ↓
Track the same point across multiple frames
  ↓
Use SLAM to know the camera pose at each frame
  ↓
Intersect multiple viewing rays
  ↓
Recover a 3D point x, y, z
```

So HumanEgo is not reconstructing the entire image in 3D.

Instead, it estimates sparse 3D keypoints on the task-relevant objects.

You can inspect the related output files like this:

```bash
ls -lh \
  "$MPS/preprocess/cotracker_results.json" \
  "$MPS/preprocess/camtriangulator_results.json" \
  "$MPS/preprocess/camtriangulator_vis.ply" \
  "$MPS/preprocess/camtriangulator_vis.png"
```

![](https://cdn.hashnode.com/uploads/covers/69ce85bb0ff860b6def2c2ab/9b92640f-9d19-4b4e-b969-575460d23f34.png align="center")

Why is this 3D step necessary? Because a robot cannot act only from image pixels.

An image gives us:

```text
u, v
```

But a robot needs targets like:

```text
x, y, z
end-effector orientation
gripper open / close state
```

That is why HumanEgo builds a 3D relationship between the hand and objects, and later passes that relationship to the model as ICT.

* * *

## What Is `training_data.json`?

One of the most important preprocessing outputs is:

```text
training_data.json
```

This file is the source material for ICT.

Conceptually:

```text
training_data.json
  hand 4x4 pose
  object 4x4 poses
  grasp state
  reference coordinate frame
  image paths
        ↓
training dataloader
        ↓
x_ict tensor
```

Let’s inspect one example.

```bash
python - <<'PY'
import json
from pathlib import Path

mps = Path("./data/serve_bread/aria/mps_serve_bread_000_vrs")
paths = sorted(mps.glob("preprocess/all_data/*/training_data.json"))

print("num training_data:", len(paths))
path = paths[0]
print("sample:", path)

with path.open() as f:
    d = json.load(f)

print("metadata keys:", d["metadata"].keys())
print("obs keys:", d["obs"].keys())

print("\nhands:")
for side, h in d["entities"]["hands"].items():
    print(side, "grasp=", h["grasp"])
    print("T_hand_to_world first row:", h["T_hand_to_world"][0])

print("\nobjects:")
for k, obj in d["entities"]["objects"].items():
    print(k)
    print("T_obj_to_world first row:", obj["T_obj_to_world"][0])
PY
```

Example output:

```text
num training_data: 864
sample: data/serve_bread/aria/mps_serve_bread_000_vrs/preprocess/all_data/00128/training_data.json
metadata keys: dict_keys(['idx', 'ts', 'w', 'h', 'fps', 'k', 'c2w', 'anchor_key', 'is_finished', 'world_transforms'])
obs keys: dict_keys(['mask_arm_path', 'mask_obj_path', 'rgb_path', 'rgb_WArmObjKpts_path', 'rgb_WoArm_path', 'rgb_WoArm_WArmObjKpts_path'])

hands:
right grasp= 0.0
T_hand_to_world first row: [-0.9652508471493321, 0.2561715273038293, 0.051621888729812965, -0.7862446231727486]

objects:
obj1
T_obj_to_world first row: [-0.498511026341544, 0.7874135934764195, -0.3625832172394254, -0.707926009796382]
obj2
T_obj_to_world first row: [-0.24740682137970874, -0.24193038536352468, -0.9382214841767175, -0.6432356533752537]
```

This tells us that each frame contains information about:

```text
image paths
camera metadata
hand pose
object poses
grasp state
```

The dataloader later turns these values into tensors used by the model.

* * *

## What Is ICT?

ICT stands for **Interaction-Centric Tokens**.

ICT is a numerical representation separate from the image. It converts each hand and object into a vector token.

HumanEgo describes ICT as a viewpoint- and embodiment-invariant encoding. In practice, it represents each hand or object as a 6DoF entity, and also represents the hand pose relative to that entity.

A simplified view looks like this:

```text
[right hand token]
[bread token]
[plate token]
[padding]
[padding]
...
```

In the single-hand setting, each token is roughly 20-dimensional.

```text
[type_id]                  1 dimension
[pose_in_ref]              9 dimensions
  position xyz             3 dimensions
  rotation rot6d           6 dimensions
[hand_in_entity]           9 dimensions
[flag]                     1 dimension
-------------------------------
total                      20 dimensions
```

Let’s generate ICT using the dataloader and inspect it.

```bash
python - <<'PY'
from training.FlowMatchingDataloader import FlowMatchingDataloader, MPSSessions

mps = "./data/serve_bread/aria/mps_serve_bread_000_vrs"

ds = FlowMatchingDataloader(
    sessions=[MPSSessions(mps_path=mps)],
    single_hand=True,
    single_hand_side="right",
    max_ict=8,
    img_name="rgb_WoArm_WArmObjKpts.png",
    use_legacy_image_loading=False,
    enable_augmentation=False,
)

sample = ds[0]

print("json_path:", sample["json_path"])
print("x_rgb shape:", tuple(sample["x_rgb"].shape))
print("x_ict shape:", tuple(sample["x_ict"].shape))
print("ict_mask:", sample["ict_mask"].tolist())
print("y_action shape:", tuple(sample["y_action"].shape))

for i, valid in enumerate(sample["ict_mask"]):
    if not bool(valid):
        continue

    tok = sample["x_ict"][i]
    print(f"\nToken {i}")
    print("  type_id              :", tok[0].item())
    print("  pose_in_ref xyz      :", tok[1:4].tolist())
    print("  pose_in_ref rot6d    :", tok[4:10].tolist())
    print("  hand_in_this xyz     :", tok[10:13].tolist())
    print("  hand_in_this rot6d   :", tok[13:19].tolist())
    print("  flag                 :", tok[19].item())
PY
```

Example output:

```text
[FlowMatchingDataloader] Built 864 samples.
  - Hand Method  : aria_mps (key='hands')
  - Image Input  : rgb_WoArm_WArmObjKpts.png
  - Frame Mode   : ANCHOR_FRAME
  - Action Mode  : ABSOLUTE
  - Aux  : ObjDynamics=True | VisForesight=True | TempContrastive=True
  - 3D PCD Feats : True
json_path: ./data/serve_bread/aria/mps_serve_bread_000_vrs/preprocess/all_data/00128/training_data.json
x_rgb shape: (3, 240, 320)
x_ict shape: (8, 20)
ict_mask: [True, True, True, False, False, False, False, False]
y_action shape: (50, 10)

Token 0
  type_id              : 2.0
  pose_in_ref xyz      : [0.16784971952438354, -0.02881561778485775, -0.07735089212656021]
  pose_in_ref rot6d    : [0.2640394866466522, -0.9627724289894104, -0.9046151041984558, -0.26802101731300354, 0.3345962166786194, 0.03512915223836899]
  hand_in_this xyz     : [1.1102230246251565e-16, 3.8163916471489756e-17, 0.0]
  hand_in_this rot6d   : [1.0, -5.938278001447962e-17, 5.938278001447962e-17, 1.0, 6.2144218264730725e-19, 2.566785074529357e-18]
  flag                 : 0.0

Token 1
  type_id              : 3.0
  pose_in_ref xyz      : [-2.2069588823114827e-08, 7.90906717895723e-09, 2.1668267180530165e-08]
  pose_in_ref rot6d    : [1.0, 3.951016847025812e-09, -3.951016847025812e-09, 1.0, 1.3024523681792743e-08, -8.553106667363863e-09]
  hand_in_this xyz     : [0.16784973442554474, -0.028815625235438347, -0.0773509219288826]
  hand_in_this rot6d   : [0.2640394866466522, -0.9627724289894104, -0.9046151041984558, -0.26802101731300354, 0.3345962166786194, 0.03512916341423988]
  flag                 : -1.0

Token 2
  type_id              : 4.0
  pose_in_ref xyz      : [-0.24469204246997833, -0.06749166548252106, 0.01143864169716835]
  pose_in_ref rot6d    : [0.962512731552124, 0.03149929270148277, 0.2607153654098511, 0.1665089875459671, -0.07481195032596588, 0.9855366945266724]
  hand_in_this xyz     : [0.4138026237487793, -0.0680706650018692, 0.060859549790620804]
  hand_in_this rot6d   : [-0.00673748878762126, -0.9991859197616577, 0.18744736909866333, -0.040333494544029236, 0.9822515845298767, 0.0008433718467131257]
  flag                 : -1.0
```

One thing to note here: `x_ict` is not a file name. It is the name of a PyTorch tensor passed into the model.

* * *

## Predicting a Future 50-Step Trajectory

During training, HumanEgo learns to predict a short future 6DoF hand motion from a first-person image and ICT.

In the default configuration, `pred_horizon` is 50.

This means the model predicts 50 future steps. If the Aria RGB stream is 30 FPS, this roughly corresponds to:

```text
50 frames / 30 fps ≒ 1.67 seconds
```

However, the actual time span can change if you modify the stride or the control frequency during real robot inference.

The model does not predict the detailed positions of the two virtual gripper fingers.

Instead, for each future step, it predicts:

```text
end-effector position
end-effector orientation
grasp value
```

In the single-hand setting, each step has 10 dimensions:

```text
[x, y, z, r1, r2, r3, r4, r5, r6, grasp]
```

So a 50-step trajectory becomes:

```text
50 × 10
```

* * *

## Training

Training can be started with:

```bash
python -m training.FlowMatchingTrainer \
  --task serve_bread \
  --use_cfg \
  --job HumanEgo \
  2>&1 | tee blog_logs/train_400.log
```

In the default configuration:

```text
epochs       = 400
batch_size   = 32
pred_horizon = 50
```

At a high level, training does this:

```text
current image
+
current ICT
  ↓
learn to generate the future 50-step end-effector trajectory
```

It is important to clarify what is being trained here.

This training step does **not** retrain SAM2, Grounding DINO, LaMa, or CoTracker.

Those models are used during preprocessing.

The model trained by HumanEgo is the policy that uses:

```text
processed images
hand and object poses
training_data.json
ICT
future hand trajectories
```

to generate a future end-effector trajectory from the current observation.

In other words, HumanEgo trains a trajectory generation policy. This policy tells the robot where the end-effector should move next.

HumanEgo uses **Flow Matching** as the learning method for this trajectory generation policy.

* * *

## What Is Flow Matching?

Flow Matching is a general generative modeling method.

It learns a vector field that moves samples from a noise distribution toward the data distribution.

In HumanEgo, this idea is applied to future robot trajectory generation.

During training, we first have the ground-truth future trajectory from the human demonstration:

```text
x1 = ground-truth future 50-step trajectory
```

Then we sample random noise with the same shape:

```text
x0 = random trajectory
```

Next, we create an intermediate trajectory:

```text
xt = (1 - t) x0 + t x1
```

The model is conditioned on the current image and ICT, and learns:

```text
Given this intermediate trajectory xt,
which direction should it move to become closer to the ground-truth trajectory x1?
```

Intuitively, the model repeatedly practices the following task:

```text
random future trajectory
  ↓
predict how to move it toward a human-demonstration-like trajectory
```

Treating action sequences as generative modeling targets has become increasingly common in robot learning.

The unique part of HumanEgo is not just Flow Matching itself.

The key idea is how HumanEgo converts human first-person videos into:

```text
arm-removed images
virtual gripper images
ICT
future end-effector trajectories
```

and then uses them for robot policy learning.

* * *

## What Is Produced After Training?

After training, we do not immediately run the policy on a real robot.

The first thing to inspect is the **offline evaluation**.

Offline evaluation uses a held-out recording. In the standard public sample setup, recording `000` is held out for evaluation, while the remaining recordings are used for training.

Conceptually, offline evaluation works like this:

```text
current frame from evaluation data
  ├── processed image
  └── ICT
        ↓
trained Flow Matching policy
        ↓
predicted future 50-step end-effector trajectory
        ↓
compare with the ground-truth future trajectory
```

The important point is that this is **not a real robot success rate**.

In offline evaluation, the image, ICT, and ground-truth trajectory are already available in the evaluation dataset.

This does not include many real-world robot problems, such as:

```text
the robot pushing the object accidentally
the gripper failing to close correctly
camera-robot calibration error
object detection failure in a real environment
robot control latency
contact and slipping
```

So offline evaluation only measures how well the learned policy predicts trajectories on held-out data.

* * *

## Understanding the Output Files

After training, results are saved under:

```text
runs/serve_bread/HumanEgo/
```

The run directory contains files such as:

```text
latest.pt
dataset_stats.json
config.json
train_curve.png
eval_curve.png
eval_snapshots/eval_ep_*.json
eval_render/epoch_*/
```

If we just look at this as a file list, it can be hard to understand what each file means.

A more useful way is to group them by role.

```text
A. Trained policy weights
   latest.pt

B. Information needed to reproduce or use the policy
   config.json
   dataset_stats.json

C. Information for checking training progress
   train_curve.png

D. Numerical results from offline evaluation
   eval_curve.png
   eval_snapshots/eval_ep_*.json

E. Visual results from offline evaluation
   eval_render/epoch_*/
```

The relationship between these files is:

```text
training_data.json
        ↓
training
        ↓
latest.pt is created
        ↓
apply latest.pt to evaluation data
        ↓
compare predicted trajectories with ground-truth trajectories
        ↓
eval_curve.png / eval_snapshots / eval_render are created
```

So `eval_curve.png` and `eval_render/` are not just extra files.

They are the outputs that help us check how well `latest.pt` predicts future trajectories on evaluation data.

* * *

## Inference

The trained `latest.pt` can be used as policy weights for inference.

However, there are two different meanings of inference here.

The first one is **offline inference**.

```text
existing evaluation data
  ↓
generate future trajectories
  ↓
compare them with ground truth
```

The second one is **real robot inference**.

```text
real camera
+
real robot
+
perception
+
calibration
+
controller
  ↓
generate future end-effector trajectory
  ↓
move the robot
```

This article focuses on the first one: offline inference.

The offline inference flow is:

```text
evaluation training_data.json
evaluation image
        ↓
dataloader generates ICT
        ↓
policy loaded from latest.pt
        ↓
generate future 50-step end-effector trajectory
        ↓
compare with ground truth
```

The results appear as:

```text
eval_curve.png
eval_snapshots/
eval_render/
```

* * *

## How to Read Offline Evaluation Results

When checking offline evaluation, I usually look at the outputs in this order:

```text
1. train_curve.png
      Did training itself proceed normally?

2. eval_curve.png
      Did the evaluation error decrease?

3. eval_snapshots/eval_ep_*.json
      What are the final quantitative metrics?

4. eval_render/epoch_*/
      Do the predicted trajectories look natural?
```

train\_curve.png and eval\_curve.png look like this:

![](https://cdn.hashnode.com/uploads/covers/69ce85bb0ff860b6def2c2ab/b4051879-b780-4284-84ae-9e58c41058cb.png align="center")

![](https://cdn.hashnode.com/uploads/covers/69ce85bb0ff860b6def2c2ab/0fecaee6-167c-44e5-80e0-416b22daa8ee.png align="center")

For example, the final evaluation video may be located at:

runs/serve\_bread/HumanEgo/eval\_render/epoch\_0400/evaluation\_vis.mp4  

%[https://youtube.com/shorts/cVYoChnYpHg?feature=share] 

The main metrics from:

```text
runs/serve_bread/HumanEgo/eval_snapshots/eval_ep_0400.json
```

look like this:

```text
num snapshots: 400
latest: runs/serve_bread/HumanEgo/eval_snapshots/eval_ep_0400.json
epoch: 400
frames: 864
done_acc: 0.9085648148148148
pos_err_k1_m: 0.03651409354750757
pos_err_kK_m: 0.037472219930754766
rot_err_k1_deg: 5.2715283058307785
rot_err_kK_deg: 7.541413042280409
pos_err_w_m: 0.037069905003556954
rot_err_w_deg: 6.384099839530582
grasp_f1_k1: 0.8659217877094972
grasp_f1_kK: 0.8555478018143754
grasp_f1_w: 0.8580602059355177
zero_state_ratio: 0.0
```

Again, we need to be careful here.

Offline evaluation is **not** the same as real robot success rate.

Offline evaluation checks:

```text
current image + current ICT
        ↓
predicted future trajectory
        ↓
comparison with ground-truth future trajectory
```

But in a real robot setup, many additional factors appear:

```text
object pose estimation
camera-robot calibration
robot control
contact dynamics
slipping
grasp failure
latency
```

So even if offline evaluation looks good, it does not guarantee real-world robot success.

However, if offline evaluation is bad, it is a strong signal that we should check the training data, preprocessing results, ICT, and configuration before moving to a real robot.

* * *

## What I Learned from This Experiment

After running HumanEgo on the `serve_bread` sample, the overall structure became much clearer to me.

Here is the big picture:

```text
preprocess
  Converts human videos into a format suitable for robot learning

training_data.json
  Stores hand poses, object poses, camera poses, grasp states, and image paths

ICT
  Numerical tokens generated from training_data.json during training and inference

training
  Learns a policy that generates future end-effector trajectories from current images and ICT

latest.pt
  Trained policy weights

inference
  Generates future end-effector trajectories from clean images and ICT,
  then passes them to a robot controller
```

The most important takeaway is that HumanEgo is not simply “training on human videos.”

It first transforms human egocentric demonstrations into a robot-friendly representation:

```text
remove human appearance
represent interaction in 3D
encode hands and objects as ICT
learn future end-effector trajectories
```

This makes it easier to understand why preprocessing is so central to the method.

For me, the `serve_bread` example was useful because it showed the entire pipeline end to end:

```text
raw human video
  ↓
preprocessing outputs
  ↓
training_data.json
  ↓
ICT
  ↓
Flow Matching policy training
  ↓
offline trajectory evaluation
```

Before trying real robot execution, I think this offline pipeline is the right place to start.

It helps answer a very practical question:

```text
Can the model predict reasonable future trajectories from the processed human demonstration data?
```

Once that part looks reliable, the next challenge is connecting the learned policy to real robot perception, calibration, and control.