OMG: Omni-Modal Motion Generation for Generalist Humanoid Control

OMG learns a generalist motion generator that converts diverse human intent signals into trackable whole-body trajectories for Unitree G1 control.

Tsinghua University

* Equal contribution   Corresponding author

OMG overview showing OMG-Data, OMG-DiT, motion tracking, real G1 execution, and omni-modal control.

Teaser Video

Omni-modal control in real time.

OMG turns diverse conditioning modalities into executable Unitree G1 whole-body motions through a generator-tracker hierarchy.

Overview

Toward Generalist Humanoid Control through Motion Generation

Humanoid whole-body control has made rapid progress, but RL-based methods remain tied to narrow skill policies and heavy reward engineering, while motion-tracking methods still require a reference motion at inference time. This leaves an important missing layer: a motion generator that can translate high-level, multi-modal intent into future robot motion before a low-level controller executes it.

The system follows a generator-tracker hierarchy. OMG-DiT predicts future trajectories from language, audio, human reference motion, motion history, and compositions of these conditions in real time; a pretrained motion tracker then converts those references into physically executable Unitree G1 trajectories.

This hierarchy is empowered by OMG-Data, a 1174.66-hour omni-modal humanoid motion corpus acquired by retargeting, filtering, annotating, and aligning publicly available data into the Unitree G1 motion space. New control interfaces can be added through lightweight encoders while reusing the pretrained motion prior.

Contributions

01

OMG framework

Unifies diverse human intent signals under a generator-tracker hierarchy that maps conditions into physically executable robot trajectories.

02

OMG-Data

Curates a large-scale omni-modal humanoid corpus through retargeting, filtering, annotation, and alignment into the Unitree G1 motion space.

03

OMG-DiT

Introduces a diffusion-based backbone with extensible and compositional conditioning from language, audio, and human reference motions.

04

Foundation Model Behaviors

Demonstrates state-of-the-art omni-modal control, model scaling, sample-efficient adaptation, and zero-shot composition of control signals.

OMG-Data

1174.66 hours of robot-executable omni-modal motion data.

OMG-Data aggregates heterogeneous graphics and humanoid datasets, validates and segments the raw clips, retargets motions into Unitree G1 embodiment, and filters physically invalid trajectories through simulation-in-the-loop screening.

1174.66h Total processed data
1166.6h Text-labeled motion
958.77h Human reference motion
191.6h Audio-conditioned motion
Dataset statistics for OMG-Data across text, audio, and human reference modalities.
OMG-DiT architecture with shared diffusion backbone and modality-specific condition encoders.

OMG-DiT

One diffusion backbone, many control modalities.

OMG-DiT decouples the motion prior from the conditioning modality. A shared denoising transformer models feasible Unitree G1 motion, while language, music, and human reference motion can all condition the generator through modality-specific encoders, cross-attention, FiLM adapters, and classifier-free guidance.

Language conditioning

Motion history and language instructions enter the shared motion prior through global context tokens.

Music conditioning

Frame-aligned music and audio cues modulate timing, rhythm, and style throughout denoising.

Human reference motion

Reference trajectories condition the generator with per-frame motion guidance for whole-body control.

Results Highlights

Demo videos by control signal.

OMG turns different conditioning modalities into executable G1 whole-body motion. The composition demo is a continuous take where the robot switches control signals and modalities in real time.

Citation

Reference OMG.

If you find this work useful, consider citing:

TBD