Embodied Reinforcement Learning具身强化学习

UniLab A Heterogeneous Architecture for Robot RL
Beyond GPU-Dominant Paradigms
面向机器人强化学习的异构架构
超越 GPU 主导范式

UniLab decouples CPU simulation, GPU learning, and a shared-memory runtime so that simulators and learners stop waiting on each other. The result is an end-to-end pipeline that runs unchanged on Apple Silicon (MPS / MLX) — a first-class target where prior training stacks force a Linux/CUDA detour — plus NVIDIA, AMD, and Intel, and reaches policy targets 3–10× faster than tightly-coupled GPU-centric baselines. UniLab 将 CPU 仿真GPU 学习共享内存运行时解耦,让仿真器与学习器不再相互等待。最终得到一条端到端流水线,可在 Apple Silicon(MPS / MLX) 上原样运行——以往的训练栈在此被迫绕道 Linux/CUDA——并支持 NVIDIA、AMD 与 Intel,以 3–10× 的速度达到策略目标,优于紧耦合的 GPU 中心化基线。

3–10×
End-to-end speedup端到端加速
2
Physics backends物理后端
MotrixSim · MuJoCo
4
Platforms平台
macOS · CUDA · ROCm · Intel Arc
5
Robot categories机器人类别
arms · quadrupeds · humanoids · hands · wheeled-leg机械臂 · 四足 · 人形 · 灵巧手 · 轮足
7
Algorithms算法
PPO · APPO · SAC · TD3 · FlashSAC · HORA · HIM-PPO
7
Task families任务族
walk · parkour · dance · flip · skate · loco-manip · dex-manip行走 · 跑酷 · 舞蹈 · 翻转 · 滑行 · 移动操作 · 灵巧操作
14 tasks shipped已上线 14 个任务

Contributions贡献

What UniLab brings to embodied RLUniLab 为具身强化学习带来什么

01

CPU sim, GPU learn — no waitingCPU 仿真、GPU 学习——互不等待

Independent CPU rollout workers stream transitions through a lock-free shared-memory buffer to a GPU learner. Asynchronous weight sync removes the simulator–learner stall that dominates tightly-coupled pipelines. 独立的 CPU rollout 工作进程通过无锁共享内存缓冲区,将 transition 流式传给 GPU 学习器。异步权重同步消除了紧耦合流水线中主导性的仿真器–学习器阻塞。

02

macOS as a first-class targetmacOS 作为一等公民平台

Train end-to-end on Apple Silicon (MPS / MLX) — no Linux box required — with the same stack also running on NVIDIA (CUDA), AMD (ROCm), and Intel (Arc / XPU). Match your hardware, not the framework's assumptions. Apple Silicon(MPS / MLX) 上端到端训练——无需 Linux 机器——同一套栈同样运行在 NVIDIA(CUDA)、AMD(ROCm)与 Intel(Arc / XPU)上。匹配你的硬件,而不是框架的假设。

03

Broad algorithm coverage广泛的算法覆盖

On-policy (PPO, APPO, HIM-PPO), off-policy (SAC, TD3, FlashSAC), and distillation (HORA) all ride the same heterogeneous runtime — including async PPO adapted for decoupled execution. On-policy(PPO、APPO、HIM-PPO)、off-policy(SAC、TD3、FlashSAC)与蒸馏(HORA)都运行在同一套异构运行时上——包括为解耦执行改造的异步 PPO。

UniLab system architecture: a unified RL workflow assembly feeds CPU-side batched physics backends (MuJoCoUni / MotrixSim); a heterogeneous simulation-and-learning core (CPU workers, host IPC, GPU device learner) coordinates data, scheduling, and parameter sync; and the trained policies drive real-robot applications.
UniLab system architecture. One pipeline answers all three: CPU workers feed a host shared-memory buffer that a GPU learner consumes (decoupled CPU-sim / GPU-learn), the same runtime hosts PPO · APPO · FastSAC · FlashSAC, and the whole stack runs across CUDA, Apple, AMD, and Intel. UniLab 系统架构。 一条流水线同时回答三件事:CPU 工作进程将数据送入主机端共享内存缓冲区,由 GPU 学习器消费(CPU 仿真 / GPU 学习解耦);同一运行时承载 PPO · APPO · FastSAC · FlashSAC;整套栈可在 CUDA、Apple、AMD 与 Intel 上运行。

Architecture架构

How the pipeline stays busy流水线如何保持满负荷

UniLab is built around two ideas: spatial decoupling of simulation and learning across heterogeneous devices, and temporal overlap of trajectory collection, gradient steps, and weight sync. UniLab 围绕两个理念构建:在异构设备间对仿真与学习进行空间解耦,以及对轨迹采集、梯度更新与权重同步进行时间重叠

One-cycle pipeline timeline showing CPU collection overlapping GPU learning in wall-clock time, with parameter synchronization near rollout boundaries.
Collection–update timing and overlap. CPU collection and GPU learning overlap in wall-clock time, with parameter synchronization near rollout boundaries — so the learner is not stalled behind rollout generation. 采集与更新的时序与重叠。 CPU 采集与 GPU 学习在 wall-clock 时间上重叠,参数同步发生在 rollout 边界附近——因此学习器不会被 rollout 生成阻塞。

Algorithm runtimes算法运行时

PPO and APPO use trajectory-oriented runners. Each algorithm declares its sync requirement and buffer usage; the runtime schedules accordingly. PPO 与 APPO 使用面向轨迹的 runner。每个算法声明自己的同步需求与缓冲区用法,运行时据此调度。

Shared-memory IPC共享内存 IPC

A single shared buffer holds rollouts and replay slices; weight sync is a lock-free publication channel. No serialization, no per-step round-trip — workers and learner only meet through memory. 单个共享缓冲区保存 rollout 与 replay 切片;权重同步是一条无锁发布通道。无序列化、无逐步往返——工作进程与学习器只通过内存相遇。

Coverage覆盖

Backends, platforms, robots, algorithms, tasks后端、平台、机器人、算法、任务

Physics backends物理后端
Platforms平台
Apple Silicon (MPS / MLX) NVIDIA CUDA AMD ROCm Intel Arc / XPU
Robots机器人
Quadrupeds四足机器人 (Go1 / Go2) Humanoids人形机器人 (G1) Dexterous hands灵巧手 (Allegro / Sharpa) Wheeled-leg轮腿机器人 (Go2w)
Algorithms算法
PPO APPO SAC TD3 FlashSAC HORA HIM-PPO
Task families任务系列
Walking行走 Parkour跑酷 Dancing舞蹈 Flipping翻滚 Skating滑冰 loco-manip 移动操作 Dexterous manipulation灵巧操作

Comparison对比

UniLab vs. GPU-centric stacksUniLab 对比 GPU 中心化方案

UniLab is not another GPU-resident simulator. It is a heterogeneous runtime that lets the simulator and the learner each run on the hardware they are best at. UniLab 并非又一个 GPU 驻留仿真器,而是一套异构运行时——让仿真器与学习器各自运行在最擅长的硬件上。

Framework框架 GPU-resident simGPU 仿真器 Heterogeneous runtime异构运行 PPO / SAC / TD3PPO / SAC / TD3 Hardware / backend support硬件 / 平台支持
IsaacLab
IsaacGym
mjlab
Genesis
IsaacSim
UniLab

● supported   ◐ partial   ○ not supported. Platform icons denote supported hardware/backend targets; the amber outline indicates evaluation-only or limited support. ● 支持   ◐ 部分支持   ○ 不支持。平台图标表示支持的硬件/后端目标;琥珀色轮廓表示仅限评估或有限支持。

Results结果

End-to-end efficiency, on matched hardware相同硬件下的端到端效率

End-to-end training efficiency on representative robot control tasks, comparing UniLab against GPU-centric baselines on matched hardware.
End-to-end training efficiency. The headline result on representative robot-control tasks — 3–10× wall-clock speedup on the same single-GPU / single-CPU workstation (3.3× on G1 Flip, 8.4× on G1 Walk Flat, 11.0× on G1 Motion Tracking). 端到端训练效率。 在代表性机器人控制任务上的核心结果——在相同的单 GPU / 单 CPU 工作站上实现 3–10× 的 wall-clock 加速(G1 Flip 3.3×、G1 Walk Flat 8.4×、G1 Motion Tracking 11.0×)。
CPU simulation throughput across representative robot-control scenes for MuJoCoUni and MotrixSim backends.
Can CPU simulation keep up? Batched CPU rigid-body simulation (MuJoCoUni / MotrixSim) sustains the simulator-side throughput the heterogeneous split needs — CPU physics is not the dominant bottleneck for these workloads. CPU 仿真跟得上吗? 批量化的 CPU 刚体仿真(MuJoCoUni / MotrixSim)足以支撑异构拆分所需的仿真端吞吐——对这些工作负载而言,CPU 物理并非主要瓶颈。

From sim to real从仿真到真机

Policies that leave the simulator走出仿真器的策略

The same heterogeneous training stack produces policies that transfer to hardware. Below, a short video walks through UniLab's to-real experiments across six real-robot tasks — and right after it, you can drive the trained policies yourself in the browser. 同一套异构训练栈产出的策略可迁移到真实硬件。下面这段短视频展示了 UniLab 在六个真实机器人任务上的真机实验——紧接其后,你可以在浏览器中亲自驱动训练好的策略。

Sim-to-real demonstrations. Six real-robot tasks trained in UniLab and deployed to hardware — a deployment-side video complementing the end-to-end simulation results. 仿真到真机演示。 在 UniLab 中训练并部署到硬件的六个真实机器人任务——一段真机侧视频,作为端到端仿真结果的补充。

Task Gallery任务画廊

Try the policies in your browser在浏览器中试用这些策略

Each card opens a MotrixSim demo for a UniLab-trained policy, with compact notes on training time and behavior. 每张卡片都会打开一个 MotrixSim 演示,展示 UniLab 训练的策略,并附有训练时间与行为的简要说明。

Locomotion — Quadruped运动控制 — 四足

Go1 — Joystick (flat)

Quadruped · Go1四足 · Go1 PPO (torch)
MotrixSim — Go1 joystick (flat) · PPO preview

Go2 — Joystick (flat)

Quadruped · Go2四足 · Go2 PPO (torch)
MotrixSim — Go2 joystick (flat) preview

Go2 — Handstand

Quadruped · Go2四足 · Go2 PPO (torch)
MotrixSim — Go2 handstand preview

Go2 — Joystick (rough)

Quadruped · Go2四足 · Go2 PPO (torch)
MotrixSim — Go2 joystick (rough terrain) preview

Locomotion — Wheeled-Leg运动控制 — 轮足

Go2w — Joystick rough tiles

Wheeled-leg · Go2w轮足 · Go2w PPO (torch)
MotrixSim — Go2W tiles (rough) preview

Humanoid — Walking人形 — 行走

G1 — Walk (flat) · SAC

Humanoid · G1人形 · G1 SAC (torch)
MotrixSim — G1 walk (flat) · SAC preview

G1 — Walk (rough)

Humanoid · G1人形 · G1 SAC (torch)
MotrixSim — G1 walk (rough) preview

Humanoid — Whole-Body Skills人形 — 全身技能

G1 — Dance (SAC WBT)

Humanoid · G1人形 · G1 SAC (torch)
MotrixSim — G1 dance (SAC WBT) preview

G1 — Dance (motion tracking)

Humanoid · G1人形 · G1 PPO (torch)
MotrixSim — G1 dance (motion tracking) preview

G1 — Shuttle run (motion tracking)

Humanoid · G1人形 · G1 PPO (torch)
MotrixSim — G1 shuttle run (motion tracking) preview

G1 — Box tracking

Humanoid · G1人形 · G1 Object: large box
MotrixSim — G1 box tracking preview

G1 — Backflip (flip tracking)

Humanoid · G1人形 · G1 PPO (torch)
MotrixSim — G1 backflip (flip tracking) preview

G1 — Wall back-flip (flip tracking)

Humanoid · G1人形 · G1 PPO (torch)
MotrixSim — G1 wall back-flip (flip tracking) preview

G1 — Climb (motion tracking)

Humanoid · G1人形 · G1 PPO (torch)
MotrixSim — G1 climb (motion tracking) preview

Manipulation操作

Allegro — In-hand reorient

Dexterous hand · Allegro灵巧手 · Allegro PPO / APPO
MotrixSim — Allegro in-hand reorientation preview

Sharpa — In-hand reorient

Dexterous hand · Sharpa灵巧手 · Sharpa PPO / APPO
MotrixSim — Sharpa in-hand reorientation preview

Loco-manipulation

Quadruped · Go2+airbot四足 · Go2+airbot PPO
MotrixSim — Go2 + arm manipulation/locomotion preview

Cross-Platform跨平台

One framework, four platforms — including macOS一个框架,四种平台 —— 含 macOS

Almost every robot-RL stack today ships Linux/CUDA-first; Apple Silicon is an after-thought, if it works at all. UniLab is the opposite: macOS is a first-class target, running the same code on MPS and MLX as it does on CUDA, ROCm, and Intel XPU. 如今几乎所有机器人 RL 栈都以 Linux/CUDA 为先;Apple Silicon 即便能用也只是事后补充。UniLab 恰恰相反:macOS 是一等公民,在 MPS 与 MLX 上运行的代码,与在 CUDA、ROCm、Intel XPU 上完全相同。

Platform Hardware Backend Representative task Trained
macOS · MPS / MLX M3 / M5 Max, MacBookNeo MotrixSim / MuJoCo Go2 joystick (PPO), G1 walk-flat (FastSAC)
Linux · CUDA RTX 4090 + R9-9950X3D MotrixSim / MuJoCo G1 walk-flat, Go2 joystick, Allegro in-hand
Linux · ROCm AMD GPU MotrixSim / MuJoCo G1 walk-flat (ROCm AMP)
Linux · XPU Intel Arc MotrixSim / MuJoCo Go2 joystick

See the paper appendix for full per-platform setup, seeds, and wall-clock numbers.

Cross-platform training curves and final performance on representative devices including Apple M5 Max, AMD, and Intel.
Trainable across platforms, not just benchmarked. Training curves and final performance on representative devices — including Apple Silicon (M5 Max), AMD, and Intel — showing practical trainability outside a single CUDA-centric setup. 不只是跑分,而是可跨平台训练。 代表性设备上的训练曲线与最终性能——包括 Apple Silicon、AMD 与 Intel——展示了在单一 CUDA 中心化配置之外的实际可训练性。
Task任务 RTX 4090
+ AMD 9950X3D
AMD 8060S
+ AMD AI MAX 395
AMD W7900
+ AMD 7900X
AMD MI300X
+ Intel 8568Y
M5 MAX M4 A18 Pro XPU 185H
FastSAC G1 WBT18.533.629.375.0280.7446.1
FastSAC G1 Walk Flat3.09.44.76.918.862.4115.4
FlashSAC G1 Walk Flat5.320.56.89.127.078.6114.9
FlashSAC Go2 Joystick Flat1.14.21.82.84.514.932.220.0
PPO G1 Flip Tracking16.419.615.212.516.831.2

Wall-clock training time (minutes) per (task, platform) pair; — marks runs not measured. 各平台 wall-clock 训练时间(分钟),按(任务,平台)列出;— 表示未测量的组合。

Cite引用

BibTeX

UniLab

Primary paper

arXiv
@article{jia2026unilab,
  title         = {UniLab: A Heterogeneous Architecture for Robot RL Beyond GPU-Dominant Paradigms},
  author        = {Yufei Jia and Zhanxiang Cao and Mingrui Yu and Heng Zhang and Shenyu Chen and Dixuan Jiang and Meng Li and Xiaofan Li and Yiyang Liu and Junzhe Wu and Zheng Li and XiLin Fang and Ting-Yu Tsui and Shengcheng Fu and Haoyang Li and Anqi Wang and Zifan Wang and Dongjie Zhu and Chenyu Cao and Zhenbiao Huang and Ziang Zheng and Jie Lu and Xin Ma and Zhengyang Wei and Xiang Zhao and Tianyue Zhan and Ye He and Yuxiang Chen and Yizhou Jiang and Yue Li and Haizhou Ge and Yuhang Dong and Fan Jia and Ziheng Zhang and Meng Zhang and Xiwa Deng and Zhixing Chen and Hanyang Shao and Chenxin Dong and Yixuan Li and Yizhi Chen and Bokui Chen and Kaifeng Zhang and Hanqing Cui and Yusen Qin and Ruqi Huang and Lei Han and Tiancai Wang and Xiang Li and Yue Gao and Guyue Zhou},
  journal       = {arXiv preprint arXiv:2605.30313},
  year          = {2026},
  url           = {https://arxiv.org/abs/2605.30313}
}

Physics Backends

Runtime dependencies

2 refs
@article{jia2026mujocouni,
  title={MuJoCoUni: Persistent Batched Runtime Primitives for MuJoCo},
  author={Jia, Yufei and Wu, Junzhe},
  journal={arXiv preprint arXiv:2605.24922},
  year={2026}
}

@software{motrixsim2026,
  title  = {MotrixSim: A Physics Simulation Engine for Robotics and Embodied AI},
  author = {{Motphys Team}},
  year   = {2026},
  url    = {https://motrixsim.readthedocs.io/},
  note   = {Python binary package}
}

Acknowledgments致谢

Joint Contributors联合贡献单位

Yufei Jia*, Zhanxiang Cao*, Mingrui Yu*, Heng Zhang*, Shenyu Chen*, Dixuan Jiang*,
Meng Li, Xiaofan Li, Yiyang Liu, Junzhe Wu, Zheng Li, XiLin Fang, Ting-Yu Tsui,
Shengcheng Fu, Haoyang Li, Anqi Wang, Zifan Wang, Dongjie Zhu, Chenyu Cao,
Zhenbiao Huang, Ziang Zheng, Jie Lu, Xin Ma, Zhengyang Wei, Xiang Zhao, Tianyue Zhan,
Ye He, Yuxiang Chen, Yizhou Jiang, Yue Li, Haizhou Ge, Yuhang Dong, Fan Jia,
Ziheng Zhang, Meng Zhang, Xiwa Deng, Zhixing Chen, Hanyang Shao, Chenxin Dong, Yixuan Li,
Yizhi Chen, Bokui Chen, Kaifeng Zhang, Hanqing Cui, Yusen Qin, Ruqi Huang,
Lei Han, Tiancai Wang, Xiang Li, Yue Gao, Guyue Zhou

* Core contributors. Advising. Correspondence to: Yufei Jia <jyf23@mails.tsinghua.edu.cn>.

Tsinghua University Institute for AI Industry Research, Tsinghua University Shanghai Jiao Tong University Shanghai Innovation Institute Harbin Institute of Technology, Shenzhen Beijing Institute of Technology
Northeastern University Southern University of Science and Technology Tongji University The Hong Kong University of Science and Technology (Guangzhou) National University of Singapore
Hubei University of Technology Wuhan Textile University Nanjing University Zhejiang University
AMD Sharpa D-Robotics Galbot
Motphys DISCOVER Robotics Dexmal