Embodied Reinforcement Learning具身强化学习

UniLab A Heterogeneous Architecture for Robot RL
Beyond GPU-Dominant Paradigms面向机器人强化学习的异构架构
超越 GPU 主导范式

UniLab decouples CPU simulation, GPU learning, and a shared-memory runtime so that simulators and learners stop waiting on each other. The result is an end-to-end pipeline that runs unchanged on Apple Silicon (MPS / MLX) — a first-class target where prior training stacks force a Linux/CUDA detour — plus NVIDIA, AMD, and Intel, and reaches policy targets 3–10× faster than tightly-coupled GPU-centric baselines. UniLab 将 CPU 仿真、GPU 学习 与共享内存运行时解耦,让仿真器与学习器不再相互等待。最终得到一条端到端流水线,可在 Apple Silicon(MPS / MLX) 上原样运行——以往的训练栈在此被迫绕道 Linux/CUDA——并支持 NVIDIA、AMD 与 Intel,以 3–10× 的速度达到策略目标,优于紧耦合的 GPU 中心化基线。

Arxiv GitHub Docs Paper

3–10×

End-to-end speedup端到端加速

Physics backends物理后端
MotrixSim · MuJoCo

Platforms平台
macOS · CUDA · ROCm · Intel Arc

Robot categories机器人类别
arms · quadrupeds · humanoids · hands · wheeled-leg机械臂 · 四足 · 人形 · 灵巧手 · 轮足

Algorithms算法
PPO · APPO · SAC · TD3 · FlashSAC · HORA · HIM-PPO

Task families任务族
walk · parkour · dance · flip · skate · loco-manip · dex-manip行走 · 跑酷 · 舞蹈 · 翻转 · 滑行 · 移动操作 · 灵巧操作
14 tasks shipped已上线 14 个任务

Contributions贡献

What UniLab brings to embodied RLUniLab 为具身强化学习带来什么

CPU sim, GPU learn — no waitingCPU 仿真、GPU 学习——互不等待

Independent CPU rollout workers stream transitions through a lock-free shared-memory buffer to a GPU learner. Asynchronous weight sync removes the simulator–learner stall that dominates tightly-coupled pipelines. 独立的 CPU rollout 工作进程通过无锁共享内存缓冲区,将 transition 流式传给 GPU 学习器。异步权重同步消除了紧耦合流水线中主导性的仿真器–学习器阻塞。

macOS as a first-class targetmacOS 作为一等公民平台

Train end-to-end on Apple Silicon (MPS / MLX) — no Linux box required — with the same stack also running on NVIDIA (CUDA), AMD (ROCm), and Intel (Arc / XPU). Match your hardware, not the framework's assumptions. 在 Apple Silicon(MPS / MLX) 上端到端训练——无需 Linux 机器——同一套栈同样运行在 NVIDIA(CUDA)、AMD(ROCm)与 Intel(Arc / XPU)上。匹配你的硬件,而不是框架的假设。

Broad algorithm coverage广泛的算法覆盖

On-policy (PPO, APPO, HIM-PPO), off-policy (SAC, TD3, FlashSAC), and distillation (HORA) all ride the same heterogeneous runtime — including async PPO adapted for decoupled execution. On-policy(PPO、APPO、HIM-PPO)、off-policy(SAC、TD3、FlashSAC)与蒸馏(HORA)都运行在同一套异构运行时上——包括为解耦执行改造的异步 PPO。

UniLab system architecture: a unified RL workflow assembly feeds CPU-side batched physics backends (MuJoCoUni / MotrixSim); a heterogeneous simulation-and-learning core (CPU workers, host IPC, GPU device learner) coordinates data, scheduling, and parameter sync; and the trained policies drive real-robot applications. — **UniLab system architecture.** One pipeline answers all three: CPU workers feed a host shared-memory buffer that a GPU learner consumes (decoupled CPU-sim / GPU-learn), the same runtime hosts PPO · APPO · FastSAC · FlashSAC, and the whole stack runs across CUDA, Apple, AMD, and Intel. **UniLab 系统架构。** 一条流水线同时回答三件事:CPU 工作进程将数据送入主机端共享内存缓冲区,由 GPU 学习器消费(CPU 仿真 / GPU 学习解耦);同一运行时承载 PPO · APPO · FastSAC · FlashSAC;整套栈可在 CUDA、Apple、AMD 与 Intel 上运行。

Architecture架构

How the pipeline stays busy流水线如何保持满负荷

UniLab is built around two ideas: spatial decoupling of simulation and learning across heterogeneous devices, and temporal overlap of trajectory collection, gradient steps, and weight sync. UniLab 围绕两个理念构建:在异构设备间对仿真与学习进行空间解耦,以及对轨迹采集、梯度更新与权重同步进行时间重叠。

One-cycle pipeline timeline showing CPU collection overlapping GPU learning in wall-clock time, with parameter synchronization near rollout boundaries. — **Collection–update timing and overlap.** CPU collection and GPU learning overlap in wall-clock time, with parameter synchronization near rollout boundaries — so the learner is not stalled behind rollout generation. **采集与更新的时序与重叠。** CPU 采集与 GPU 学习在 wall-clock 时间上重叠,参数同步发生在 rollout 边界附近——因此学习器不会被 rollout 生成阻塞。

Algorithm runtimes算法运行时

PPO and APPO use trajectory-oriented runners. Each algorithm declares its sync requirement and buffer usage; the runtime schedules accordingly. PPO 与 APPO 使用面向轨迹的 runner。每个算法声明自己的同步需求与缓冲区用法,运行时据此调度。

Shared-memory IPC共享内存 IPC

A single shared buffer holds rollouts and replay slices; weight sync is a lock-free publication channel. No serialization, no per-step round-trip — workers and learner only meet through memory. 单个共享缓冲区保存 rollout 与 replay 切片;权重同步是一条无锁发布通道。无序列化、无逐步往返——工作进程与学习器只通过内存相遇。

Coverage覆盖

Backends, platforms, robots, algorithms, tasks后端、平台、机器人、算法、任务

Physics backends物理后端

MotrixSim MuJoCoUni

Platforms平台

Apple Silicon (MPS / MLX)

NVIDIA CUDA

AMD ROCm

Intel Arc / XPU

Robots机器人

Quadrupeds四足机器人 (Go1 / Go2) Humanoids人形机器人 (G1) Dexterous hands灵巧手 (Allegro / Sharpa) Wheeled-leg轮腿机器人 (Go2w)

Algorithms算法

PPO APPO SAC TD3 FlashSAC HORA HIM-PPO

Task families任务系列

Walking行走 Parkour跑酷 Dancing舞蹈 Flipping翻滚 Skating滑冰 loco-manip 移动操作 Dexterous manipulation灵巧操作

Comparison对比

UniLab vs. GPU-centric stacksUniLab 对比 GPU 中心化方案

UniLab is not another GPU-resident simulator. It is a heterogeneous runtime that lets the simulator and the learner each run on the hardware they are best at. UniLab 并非又一个 GPU 驻留仿真器,而是一套异构运行时——让仿真器与学习器各自运行在最擅长的硬件上。

Framework框架	GPU-resident simGPU 仿真器	Heterogeneous runtime异构运行	PPO / SAC / TD3PPO / SAC / TD3
IsaacLab	●	○	◐
IsaacGym	●	○	◐
mjlab	●	○	◐
Genesis	●	○	◐
IsaacSim	●	○	◐
UniLab	○	●	●

● supported ◐ partial ○ not supported. Platform icons denote supported hardware/backend targets; the amber outline indicates evaluation-only or limited support. ● 支持 ◐ 部分支持 ○ 不支持。平台图标表示支持的硬件/后端目标；琥珀色轮廓表示仅限评估或有限支持。

Results结果

End-to-end efficiency, on matched hardware相同硬件下的端到端效率

End-to-end training efficiency on representative robot control tasks, comparing UniLab against GPU-centric baselines on matched hardware. — **End-to-end training efficiency.** The headline result on representative robot-control tasks — **3–10× wall-clock speedup** on the same single-GPU / single-CPU workstation (3.3× on G1 Flip, 8.4× on G1 Walk Flat, 11.0× on G1 Motion Tracking). **端到端训练效率。** 在代表性机器人控制任务上的核心结果——在相同的单 GPU / 单 CPU 工作站上实现 **3–10× 的 wall-clock 加速**(G1 Flip 3.3×、G1 Walk Flat 8.4×、G1 Motion Tracking 11.0×)。

CPU simulation throughput across representative robot-control scenes for MuJoCoUni and MotrixSim backends. — **Can CPU simulation keep up?** Batched CPU rigid-body simulation (MuJoCoUni / MotrixSim) sustains the simulator-side throughput the heterogeneous split needs — CPU physics is not the dominant bottleneck for these workloads. **CPU 仿真跟得上吗?** 批量化的 CPU 刚体仿真(MuJoCoUni / MotrixSim)足以支撑异构拆分所需的仿真端吞吐——对这些工作负载而言,CPU 物理并非主要瓶颈。

From sim to real从仿真到真机

Policies that leave the simulator走出仿真器的策略

The same heterogeneous training stack produces policies that transfer to hardware. Below, a short video walks through UniLab's to-real experiments across six real-robot tasks — and right after it, you can drive the trained policies yourself in the browser. 同一套异构训练栈产出的策略可迁移到真实硬件。下面这段短视频展示了 UniLab 在六个真实机器人任务上的真机实验——紧接其后,你可以在浏览器中亲自驱动训练好的策略。

Sim-to-real demonstrations. Six real-robot tasks trained in UniLab and deployed to hardware — a deployment-side video complementing the end-to-end simulation results. 仿真到真机演示。 在 UniLab 中训练并部署到硬件的六个真实机器人任务——一段真机侧视频,作为端到端仿真结果的补充。

Task Gallery任务画廊

Try the policies in your browser在浏览器中试用这些策略

Each card opens a MotrixSim demo for a UniLab-trained policy, with compact notes on training time and behavior. 每张卡片都会打开一个 MotrixSim 演示,展示 UniLab 训练的策略。

Locomotion — Quadruped运动控制 — 四足

Go1 — Joystick (flat)

Quadruped · Go1四足 · Go1 PPO (torch)

Go2 — Joystick (flat)

Quadruped · Go2四足 · Go2 PPO (torch)

Go2 — Handstand

Quadruped · Go2四足 · Go2 PPO (torch)

Go2 — Joystick (rough)

Quadruped · Go2四足 · Go2 PPO (torch)

Locomotion — Wheeled-Leg运动控制 — 轮足

Go2w — Joystick rough tiles

Wheeled-leg · Go2w轮足 · Go2w PPO (torch)

Humanoid — Walking人形 — 行走

G1 — Walk (flat) · SAC

Humanoid · G1人形 · G1 SAC (torch)

G1 — Walk (rough)

Humanoid · G1人形 · G1 SAC (torch)

Humanoid — Whole-Body Skills人形 — 全身技能