UniLab: A Heterogeneous Architecture for Robot RL Beyond GPU-Dominant Paradigms

Yufei Jia^1*, Zhanxiang Cao^2,3*, Mingrui Yu^1*, Heng Zhang^4*, Shenyu Chen^5*, Dixuan Jiang^6*, Meng Li⁷, Xiaofan Li⁷, Yiyang Liu¹, Junzhe Wu¹, Zheng Li¹¹, XiLin Fang⁸, Ting-Yu Tsui¹, Shengcheng Fu^9,3, Haoyang Li^2,3, Anqi Wang¹⁰, Zifan Wang¹¹, Dongjie Zhu¹, Chenyu Cao¹², Zhenbiao Huang¹³, Ziang Zheng¹, Jie Lu¹⁴, Xin Ma¹⁵, Zhengyang Wei¹⁵, Xiang Zhao⁴, Tianyue Zhan^2,3, Ye He¹⁶, Yuxiang Chen¹⁷, Yizhou Jiang¹, Yue Li¹⁰, Haizhou Ge¹, Yuhang Dong¹⁸, Fan Jia¹⁹, Ziheng Zhang¹⁹, Meng Zhang¹⁹, Xiwa Deng⁴, Zhixing Chen¹, Hanyang Shao¹⁰, Chenxin Dong¹⁹, Yixuan Li⁶, Yizhi Chen^9,3, Bokui Chen¹, Kaifeng Zhang²⁰, Hanqing Cui⁴, Yusen Qin²¹, Ruqi Huang¹, Lei Han^10†, Tiancai Wang^19†, Xiang Li^1†, Yue Gao^2,3†, Guyue Zhou^1†

¹THU, ²SJTU, ³SII, ⁴Motphys, ⁵HITSZ, ⁶BIT, ⁷NEU, ⁸SUSTech, ⁹TJU, ¹⁰DISCOVER Robotics, ¹¹HKUST(GZ), ¹²Galbot, ¹³NUS, ¹⁴WTU, ¹⁵HBUT, ¹⁶AMD, ¹⁷NJU, ¹⁸ZJU, ¹⁹Dexmal, ²⁰Sharpa, ²¹D-Robotics

* Core contributors. † Advising. Correspondence: Yufei Jia <jyf23@mails.tsinghua.edu.cn>

Keywords: Robot Reinforcement Learning, Systems, Heterogeneous Training

UniLab teaser: representative robot-control tasks — **Figure 1:** Representative robot-control tasks in UniLab; "Uni" means unified cross-platform training. Teaser image rendered with MotrixSim.

#Abstract

Simulation-based RL for contemporary robot control is increasingly organized around GPU-resident simulation: physics, rollout collection, and learning are placed on a single GPU-centric execution path. This paradigm has greatly improved training speed, but it has also encouraged a default assumption that efficient training requires physics to reside on the GPU. We revisit this assumption. Our view is that, in simulation-dominated robot control, the essential question is not which processor runs physics, but whether simulation throughput, policy learning, and runtime synchronization form an efficient end-to-end loop.

We present UniLab, a heterogeneous CPU-simulation / GPU-learning architecture that decouples CPU-parallel simulation from GPU policy updates through a unified runtime for data movement, buffering, and synchronization. UniLab is implemented as a complete and extensible training system using MuJoCoUni and MotrixSim CPU-batched physics backends, supporting PPO, FastSAC, FlashSAC, and APPO. On representative simulation-based robot control tasks, UniLab improves end-to-end training efficiency by $3\text{--}10\times$ under the same hardware configuration, while reducing dependence on the NVIDIA CUDA-based software stack and supporting cross-platform execution on the Apple macOS platform and the AMD ROCm and Intel XPU accelerator backends.

These results show that GPU simulation is an effective path to efficient training, but not a necessary one, broadening the practical system choices available for robot RL training. Project page: https://unilabsim.github.io.

#1. Introduction

Training infrastructure has become a first-order factor in simulation-based robot RL: faster training reduces the wall-clock cost of a single experiment, shortens system and algorithm iteration cycles, and expands the range of tasks that can be studied under practical hardware budgets. The dominant answer in recent years has been clear: place physics simulation, rollout collection, and learning on a GPU-centric execution path; Isaac Gym, Isaac Lab, MuJoCo Playground, mjlab, ManiSkill3, and Genesis show that large-scale GPU-resident environment parallelism can greatly accelerate robot control training. This success has shaped the current community default that efficient training should be organized around GPU-resident physics, tying high-throughput experimentation to a narrower set of GPU-resident software environments.

Robot RL training, however, is a closed-loop system coupling data generation, policy updates, and synchronization constraints, not a simulator benchmark alone. In simulation-dominated tasks, end-to-end efficiency depends on simulation throughput, learner utilization, collector–learner synchronization, data movement and buffering overhead, and whether hardware is allocated to the stage that actually limits wall-clock time: the learner may wait for rollouts, collectors may wait for new parameters, and data movement or buffering may erase parallel gains. Whether physics runs on the GPU is therefore one design choice within a broader systems organization problem.

High-throughput environment execution is also possible outside GPU-resident physics. General RL systems have long used CPU-side vectorized or batched environments, and robot RL has precedents for CPU-distributed or CPU-parallel simulation, including OpenAI's Rubik's-cube hand system and recent RaiSim-based locomotion work. Algorithmic data dependencies further shape this organization: PPO preserves the strongest rollout/update synchronization constraint; APPO allows collection and learning to overlap while remaining close to the on-policy setting; and off-policy methods such as FastSAC and FlashSAC further relax the dependence of each update on trajectories from the latest policy. This ordering lets us study algorithms as synchronization regimes: PPO tests whether CPU simulation can sustain strictly synchronized training, APPO tests collector–learner overlap once synchronization is relaxed, and FastSAC/FlashSAC test the replay-based producer–consumer path. This motivates the systems question studied here: can CPU-side batched rigid-body simulation, GPU-side policy learning, and the runtime path between them form an efficient end-to-end training loop?

This paper asks whether efficient simulation-based robot control training must rely on GPU-resident simulation. Our thesis is that simulation-dominated robot control training requires high-throughput, well-coordinated simulation-learning execution, rather than GPU-resident simulation itself. We focus on representative robot control tasks in simulation, leaving real-world RL and vision-dominated settings outside the scope of this paper.

We present UniLab, a heterogeneous CPU-simulation / GPU-learning training architecture. CPU-side MuJoCoUni and MotrixSim backends perform batched rigid-body simulation and data generation, GPU resources perform policy and value learning, and a unified runtime coordinates data movement, buffering, and synchronization. UniLab is a training-system organization rather than a new policy optimization algorithm; it is implemented as a complete and extensible training system with unified training and evaluation entrypoints and explicit task/backend interfaces, while supporting PPO, FastSAC, FlashSAC, and APPO in one framework.

Across representative simulated robot-control benchmarks, UniLab improves end-to-end training efficiency by $3\text{--}10\times$ on the same single-GPU/single-CPU workstation, while reducing dependence on the NVIDIA CUDA-based software stack and supporting execution on Apple macOS, AMD ROCm, and Intel XPU backends. Our contributions are threefold:

Systems framing. We recast efficient robot RL training as a systems organization problem for the simulation-learning closed loop, rather than a consequence of GPU-resident physics alone.
Heterogeneous training architecture. We present UniLab, which connects CPU-batched physics backends, a GPU learner, data buffering, and parameter synchronization through a unified runtime, while supporting PPO, FastSAC, FlashSAC, and APPO in one framework.
End-to-end evidence. We show $3\text{--}10\times$ wall-clock gains across robot embodiments, control workloads, and practical algorithms, together with execution evidence on macOS, ROCm, and XPU backends.

#2. Related Work

#2.1 GPU-Resident Robot Learning

**Table 1:** Representative robot RL training systems.
System	Physics	Batch	Coupling
`IsaacGym`	PhysX	GPU-C	GPU-sync
`IsaacLab`	PhysX	GPU-C	GPU-sync
`Genesis`	Taichi	GPU-C/M/R	GPU-sync
`MJP`	MJX	GPU-C	GPU-sync
`MjLab`	MJWarp	GPU-C	GPU-sync
UniLab	MJU/Mtx	CPU	H-async/sync

Note. GPU-C/M/R: GPU batched physics on CUDA/Metal/ROCm. GPU-sync: synchronized GPU simulation–learning; H-async/sync: CPU simulation with GPU learning. MJU/Mtx/MJP: MuJoCoUni/MotrixSim/MuJoCo_playground.

The dominant systems path for efficient robot RL training has been to place physics simulation, rollout collection, and learning on a GPU-centric execution path. MuJoCo provides a widely used foundation for robot control simulation, while Isaac Gym, Isaac Lab, MuJoCo Playground, mjlab, ManiSkill3, and Genesis have made large-scale GPU-resident environment parallelism a standard practice for robot learning.

#2.2 Systems Lesson from GPU Simulation

The central lesson from GPU-resident systems is the integration of fast physics execution with tightly coupled rollout collection and learner updates. For on-policy methods such as PPO, this organization fits synchronized batched rollout/update cycles and has proven effective across robot-control workloads. We adopt this systems lesson but separate the training-system principle from one hardware path: efficient training requires low-overhead data generation, learning, and synchronization, while GPU kernels are most effective for regular, dense, and statically shaped execution; dynamic active contact sets, sparse interactions, collision handling, contact solving, closed-chain or other constraint handling, and contact-rich manipulation all stress this execution model.

#2.3 CPU-Parallel Environment Execution

High-throughput environment execution also has a history outside GPU-resident physics. In general RL, EnvPool, RLlib, Tianshou, and PufferLib use CPU-side vectorized, batched, or parallel rollout collection as core system components. Robot RL also has CPU-distributed or CPU-parallel precedents, including OpenAI's Rubik's-cube hand system and recent RaiSim-based locomotion work. These examples show that CPU-side environment parallelism is viable; UniLab asks whether, under the same hardware setting, modern CPU-batched simulation and a GPU learner can form an efficient end-to-end training path through a low-overhead runtime rather than only at extreme worker-cluster scale.

#2.4 Replay-Based Robot-Control Acceleration

Algorithmic data dependencies further shape the system organization. PPO is the practical default in many large-scale robot-training workloads, but its on-policy updates preserve strong synchronization between rollout generation and learner updates. Replay-based methods such as SAC and TD3 can reuse past experience and relax this dependence, while FastTD3, FastSAC, and FlashSAC show that this direction can accelerate high-dimensional robot control. UniLab studies the complementary systems question: when data dependencies are relaxed, how can CPU simulation and GPU learning be coordinated to improve end-to-end wall-clock efficiency?

#3. UniLab Architecture

This section describes UniLab as an end-to-end training loop that combines CPU-side batched rigid-body simulation, GPU-side policy and value learning, and a unified runtime for coordinating the data path between them.

**Figure 2:** UniLab system architecture. The figure shows the data, scheduling, and parameter-synchronization paths between CPU-side batched physics backends, the unified runtime, and the GPU learner.

#3.1 Design Objective and Requirements

The design objective is to improve the efficiency of the full simulation-learning loop without requiring GPU-resident simulation. UniLab follows hardware roles: CPUs generate large-scale simulation data, GPUs perform dense learning updates, and the runtime minimizes coordination cost. This objective induces three requirements:

CPU-side simulation throughput. CPU-side batched rigid-body simulation must sustain enough throughput to continuously generate data for the workloads studied here.

Non-blocking GPU learning. The GPU learner should consume buffered experience rather than idling behind rollout generation.

Controlled runtime overhead. Data movement, buffering, and parameter synchronization must remain low-overhead so that the heterogeneous split does not degenerate into blocking handoffs.

#3.2 Execution Architecture

The system organization consists of: CPU workers that generate trajectories or transitions, a GPU learner that performs policy and value updates, and a unified runtime that coordinates data movement, buffering, scheduling, and parameter synchronization.

Collection–update timing and overlap. UniLab supports both synchronized and loosely coupled collection–update timing. Standard PPO uses a synchronized rollout/update cycle. APPO follows an asynchronous on-policy formulation: the collector writes fixed-horizon rollouts into a shared ring buffer while continuing on the CPU; the learner drains available rollouts and performs V-trace correction and PPO-style updates on the GPU. CPU collection and GPU learning therefore overlap in wall-clock time. FastSAC and FlashSAC use replay-based timing: collectors insert transition batches into a shared replay buffer, while the learner performs multiple updates from device batches.

Runtime abstraction. The unified runtime lets synchronized and loosely coupled execution share one system stack, connecting robot assets, task configurations, simulation backends, and learning algorithms through explicit interfaces.

#3.3 CPU Physics Backends and Task Interface

Batched CPU physics. UniLab realizes CPU-side throughput through backend-native batched environment execution: CPU workers advance environments at batch granularity and generate trajectories or transitions for the downstream learner.

Backend contract. The current system connects two practical CPU-side simulation backends under a shared runtime contract. MuJoCoUni provides a CPU-batched MuJoCo runtime backend; the MotrixSim backend maps the same task and runtime contract onto the MotrixSim physics and rendering stack.

Task and randomization interface. This contract covers task state, actions, observation-related data, reset and interval randomization hooks, terrain context, and playback capabilities. This design separates physics semantics from training throughput; the same learner binding can also target macOS, ROCm, and XPU.

#4. Experiments

We evaluate three questions: whether CPU simulation provides enough throughput, whether heterogeneous CPU-simulation / GPU-learning improves end-to-end wall-clock efficiency, and whether the result is robust across task families and algorithms.

#4.1 Experimental Setup

Controlled comparisons use the same default Linux hardware: one NVIDIA RTX 4090 GPU, one AMD Ryzen 9 9950X3D CPU, and 64 GB of 4800 MT/s memory. The task set spans locomotion, motion tracking, manipulation, and manipulation-locomotion across quadruped, wheeled-quadruped, humanoid, and dexterous-hand embodiments. Algorithms are organized by synchronization constraints: PPO (strictly synchronized), APPO (near-on-policy with overlap), and FastSAC/FlashSAC (replay-based producer–consumer).

#4.2 Can CPU Simulation Provide Enough Throughput?

In common robot-RL training settings, CPU physics does not necessarily provide lower throughput than GPU-based simulation; its relative advantage is more pronounced in workloads with complex contact and dexterous manipulation. Batched CPU simulation provides the simulator-side capacity required by the heterogeneous execution model.

**Figure 4:** CPU simulation throughput across representative robot control scenes. The figure establishes the simulator-side capacity that underlies the end-to-end training results.

**Table 2:** CPU env-step throughput ($10^3$ steps/s) by task and chip.
Chip	Go2		G1		Hand
Chip	MJ	Motrix	MJ	Motrix	MJ	Motrix
A18 Pro	55.7	122.9	28.4	18.1	183.9	134.1
M5 Max	288.0	797.8	178.8	127.7	1118.4	982.9
R9-8945HX	246.2	704.2	154.6	113.6	434.1	542.2
TR-9980X	915.9	2662.7	517.9	410.4	1991.5	2622.6
i7-11800H	82.1	162.0	34.7	23.8	176.8	151.6
Xeon 8558	1002.4	847.2	424.6	379.5	2566.3	397.7

Note. Values are $10^3$ env steps/s; MJ = MuJoCoUni backend.

#4.3 Can CPU-Sim / GPU-Learn Improve End-to-End Efficiency?

Given sufficient CPU-side throughput for strictly synchronized PPO, the next question is whether heterogeneous organization translates into end-to-end gains as data dependencies become looser. Once the runtime decouples the learner from the collector, these more loosely coupled settings obtain $3\text{--}10\times$ improvements in end-to-end training efficiency across multiple robot control tasks.

**Figure 5:** End-to-end training efficiency on representative robot control tasks. Representative speedups: $3.3\times$ on G1 Flip, $8.4\times$ on G1 Walk Flat, and $11.0\times$ on G1 Motion Tracking.

**Figure 6:** Training-cycle placement ablation. Holosoma is the FastSAC codebase used here, and MjWarp is its MuJoCo Warp backend. The figure compares where simulation collection and learning are placed during one learner cycle.

**Figure 7:** To-real experiment overview across six real-robot tasks.

#4.4 Dexterous In-Hand Rotation as a Systems Stress Test

SharpaWaveHand in-hand rotation adds contact-rich evidence beyond locomotion and motion tracking. In this task, the CPU MuJoCo version trains better, and UniLab reaches stronger HORA teacher policies within a shorter wall-clock budget. The task uses a 22-DOF tactile hand to rotate a randomized free object and shows that UniLab supports dense simulation, stable learning, and different synchronization constraints in dexterous teacher training.

#4.5 Cross-Platform Evidence

Finally, we report Apple macOS, AMD ROCm, and Intel XPU results to show practical trainability outside a single CUDA-centric setup, without claiming absolute throughput parity with the main Linux/CUDA workstation. Cross-platform execution is a practical consequence of the UniLab interface design.

**Figure 8:** Cross-platform training overview on representative devices. The figure shows training curves and final performance on different platforms.

**Table 3:** Wall-clock training time (min.).
Device	FastSAC / G1 WBT	FastSAC / G1 Walk	FlashSAC / Go2 Joy.	PPO / G1 Flip
RTX 4090 (Baseline)	58.8	18.3	6.0	109.0
RTX 4090 + AMD 9950X3D	18.5	3.0	1.1	16.4
AMD 8060S + AMD AI MAX 395	33.6	9.4	4.2	19.6
M5 Max	75.0	18.8	4.5	16.8

#5. Conclusion

This paper presented UniLab, a heterogeneous CPU-simulation / GPU-learning architecture for robot RL. By coordinating data movement, buffering, and synchronization through a unified runtime, UniLab improves end-to-end training efficiency by $3\text{--}10\times$ across multiple robot embodiments, control workloads, and practical algorithms, while reducing dependence on the NVIDIA CUDA-based software stack and supporting Apple macOS, AMD ROCm, and Intel XPU backends. These results show that efficient training depends on high-throughput, well-coordinated simulation-learning execution, rather than requiring physics to reside on the GPU; UniLab therefore provides a systems counterexample showing that the design space for efficient training is broader than the current GPU-centric default suggests.

#6. Discussion

Our claim is not that GPU-resident simulation is obsolete. GPU simulation may remain preferable when simulator throughput is no longer the bottleneck or when larger accelerator-rich configurations are a better fit. UniLab broadens the design space for simulation-dominated robot control.

The speed of a GPU-centric stack comes from two coupled designs: simulation, rollout collection, and learning share a low-overhead execution path, while the physics backend is organized as GPU-friendly parallel computation. The former is a training-system organization principle; the latter is one hardware path for realizing it. This path is effective for regular, dense, and statically shaped computation, but dynamic contacts, sparse interactions, collision handling, and constraint solving can increase backend engineering pressure. Thus, this paper does not challenge the value of GPU simulators; it challenges the necessity claim that efficient robot RL training must use GPU-resident physics.

#7. Limitations

The main limitations follow from three assumptions. First, UniLab is most advantageous when training is simulation-dominated and simulation can be meaningfully decoupled from learning; on strictly synchronized pipelines or vision-based workloads, CPU/GPU decoupling may yield smaller gains. Second, our claim concerns end-to-end training efficiency in a controlled single-CPU/single-GPU setting, not absolute peak throughput at extreme scale. Third, the current implementation focuses on rigid-body robot control rather than deformable objects, soft bodies, or fluids. Future work should extend the same runtime analysis to vision-dominated tasks, larger systems, and non-rigid physics.

#Acknowledgments

We thank Apple and AMD for providing hardware platforms for development and evaluation, and for assisting with platform adaptation. We are also sincerely grateful to the mjlab team for open-sourcing their excellent work, whose engineering practices provided valuable reference for this project. We also thank early users of UniLab and the students in Tsinghua University's Spring 2026 Deep Reinforcement Learning course for their use and feedback.

#References

Click to expand references (34 entries)

V. Makoviychuk, L. Wawrzyniak, Y. Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, et al. Isaac Gym: High performance GPU-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470, 2021.
M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Muñoz, X. Yao, R. Zurbrügg, N. Rudin, et al. Isaac Lab: A GPU-accelerated simulation framework for multi-modal robot learning. arXiv preprint arXiv:2511.04831, 2025.
K. Zakka, B. Tabanpour, Q. Liao, M. Haiderbhai, S. Holt, J. Y. Luo, A. Allshire, E. Frey, K. Sreenath, L. A. Kahrs, et al. MuJoCo Playground. arXiv preprint arXiv:2502.08844, 2025.
S. Tao, F. Xiang, A. Shukla, Y. Qin, X. Hinrichsen, X. Yuan, C. Bao, X. Lin, Y. Liu, T. Chan, et al. ManiSkill3: GPU parallelized robotics simulation and rendering for generalizable embodied AI. Robotics: Science and Systems, 2025.
Genesis Authors. Genesis: A generative and universal physics engine for robotics and beyond, December 2024.
J. Weng, M. Lin, S. Huang, B. Liu, D. Makoviichuk, V. Makoviychuk, Z. Liu, Y. Song, T. Luo, Y. Jiang, et al. EnvPool: A highly parallel reinforcement learning environment execution engine. NeurIPS, 35:22409–22421, 2022.
Z. Wu, E. Liang, M. Luo, S. Mika, J. E. Gonzalez, and I. Stoica. RLlib Flow: Distributed reinforcement learning is a dataflow problem. NeurIPS, 2021.
J. Weng, H. Chen, D. Yan, K. You, A. Duburcq, M. Zhang, Y. Su, H. Su, and J. Zhu. Tianshou: A highly modularized deep reinforcement learning library. JMLR, 23(267):1–6, 2022.
J. Suarez. PufferLib 2.0: Reinforcement learning at 1M steps/s. Reinforcement Learning Journal, 6:1378–1388, 2025.
I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, et al. Solving Rubik's cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
Y. Kim, H. Oh, J. Lee, J. Choi, G. Ji, M. Jung, D. Youm, and J. Hwangbo. Not only rewards but also constraints: Applications on legged robot locomotion. IEEE Transactions on Robotics, 40:2984–3003, 2024.
O. Pearce. Exploring utilization options of heterogeneous architectures for multi-physics simulations. Parallel Computing, 87:35–45, 2019.
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. ICML, pages 1861–1870, 2018.
Y. Jia and J. Wu. MuJoCoUni: Persistent batched runtime primitives for MuJoCo. arXiv preprint arXiv:2605.24922, 2026.
Motphys Team. MotrixSim: A physics simulation engine for robotics and embodied AI, 2026.
C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, and O. Bachem. Brax – a differentiable physics engine for large scale rigid body simulation. arXiv preprint arXiv:2106.13281, 2021.
J. Liang, V. Makoviychuk, A. Handa, N. Chentanez, M. Macklin, and D. Fox. GPU-accelerated robotic simulation for distributed reinforcement learning. CoRL, pages 270–282, 2018.
E. Todorov, T. Erez, and Y. Tassa. MuJoCo: A physics engine for model-based control. IROS, pages 5026–5033, 2012.
J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun, and M. Hutter. Learning agile and dynamic motor skills for legged robots. Science Robotics, 4(26):eaau5872, 2019.
G. B. Margolis and P. Agrawal. Walk these ways: Tuning robot control for generalization with multiplicity of behavior. CoRL, pages 22–31, 2023.
G. B. Margolis, G. Yang, K. Paigwar, T. Chen, and P. Agrawal. Rapid locomotion via reinforcement learning. IJRR, 43(4):572–587, 2024.
Z. Wang, Y. Jia, L. Shi, H. Wang, H. Zhao, X. Li, J. Zhou, J. Ma, and G. Zhou. Arm-constrained curriculum learning for loco-manipulation of a wheel-legged robot. IROS, pages 10770–10776, 2024.
T. He, J. Gao, W. Xiao, Y. Zhang, Z. Wang, J. Wang, Z. Luo, G. He, N. Sobanbab, C. Pan, et al. ASAP: Aligning simulation and real-world physics for learning agile humanoid whole-body skills. arXiv preprint arXiv:2502.01143, 2025.
Z. Cao, L. Yan, Y. Zhang, S. Chen, J. Ma, T. Zhan, S. Fu, Y. Jia, C. Lu, and Y. Gao. HiWET: Hierarchical world-frame end-effector tracking for long-horizon humanoid loco-manipulation. arXiv preprint arXiv:2602.06341, 2026.
S. Bharthulwar, S. Tao, and H. Su. Staggered environment resets improve massively parallel on-policy reinforcement learning. NeurIPS, 38:133342–133375, 2026.
A. A. Shahid, Y. Narang, V. Petrone, E. Ferrentino, A. Handa, D. Fox, M. Pavone, and L. Roveda. Scaling population-based reinforcement learning with GPU accelerated simulation. arXiv preprint, 2024.
S. Fujimoto, H. Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. ICML, pages 1587–1596, 2018.
Y. Seo, C. Sferrazza, H. Geng, M. Nauman, Z.-H. Yin, and P. Abbeel. FastTD3: Simple, fast, and capable reinforcement learning for humanoid control. arXiv preprint arXiv:2505.22642, 2025.
Y. Seo, C. Sferrazza, J. Chen, G. Shi, R. Duan, and P. Abbeel. Learning sim-to-real humanoid locomotion in 15 minutes. arXiv preprint arXiv:2512.01996, 2025.
D. Kim, Y. Lee, M. Park, K. Kim, I. Nahendra, T. Seno, S. Min, D. Palenicek, F. Vogt, D. Kragic, et al. FlashSAC: Fast and stable off-policy reinforcement learning for high-dimensional robot control. arXiv preprint arXiv:2604.04539, 2026.
M. Luo, J. Yao, R. Liaw, E. Liang, and I. Stoica. IMPACT: Importance weighted asynchronous architectures with clipped target networks. arXiv preprint arXiv:1912.00167, 2019.
Google DeepMind. MuJoCo Warp (MJWarp), 2026.
K. Zakka, Q. Liao, B. Yi, L. Le Lay, K. Sreenath, and P. Abbeel. mjlab: A Lightweight Framework for GPU-Accelerated Robot Learning. arXiv preprint arXiv:2601.22074, 2026.

#Appendix A. Off-Policy Replay Path Case Study

This section complements the system-attribution analysis with a detailed case study of the SAC replay-based execution path. Unless otherwise stated, all timeline statistics are computed from Perfetto traces collected on an A100 machine: one NVIDIA A100 80 GB PCIe GPU with driver 560.35.05 and CUDA 12.6, two Intel Xeon Gold 5320 CPUs with 104 logical CPU threads, and 188 GiB system memory.

#A.1 Baseline GPU-Cache SAC Path

We use SAC-A to denote the straightforward SAC baseline. It corresponds to the GPU-cache replay path before the sample-before-transfer pipeline. This baseline is already a heterogeneous design: a CPU collector process runs a CPU actor synchronized from learner weights, advances the batched environment, and writes transitions into shared CPU replay storage. The learner holds SAC actor and critic networks on the accelerator and periodically publishes updated actor weights back to the collector.

The remaining cost lies in the replay boundary. In the CUDA path, the learner maintains a device-side replay cache. When the learner samples, newly appended replay rows are lazily synchronized into this GPU cache, random indices are moved to the device, and the sampled batch is gathered from the cached replay tensors. Thus, replay-cache maintenance and random replay access are part of the learner's hot update path.

#A.2 Sample-Before-Transfer Replay Pipeline

UniLab moves the replay boundary from the replay buffer to the sampled batch. The collector still performs CPU actor inference, environment stepping, and replay insertion. Once the learner requests the next training batch, the collector samples rows from a replay snapshot on the CPU and packs them into one of two shared pack slots. On CUDA, these pack slots are registered as pinned host-memory sources for asynchronous H2D transfer. A learner-side background H2D submit thread then transfers the packed batch into the cold GPU batch slot while the learner consumes the current hot slot.

SAC timeline comparison — **Figure A.1:** Baseline SAC-A and optimized SAC learner-cycle timelines on A100. The optimized double-buffer path reduces cycle time from 211 ms to 136 ms, collector stall from 103 ms to below 1 ms, and resume gap from 12.3 ms to 2.9 ms.

#A.3 Trace-Based Attribution

We analyze A100 Perfetto traces for the baseline GPU-cache SAC path and the UniLab double-buffer path. These traces provide mechanism and timing evidence: they show where replay sampling, H2D transfer, learner updates, and weight publication occur.

System attribution 4-panel summary — **Figure A.2:** System-attribution summary for the optimized SAC trace. Panel A reports batching efficiency. Panels B–D summarize runtime components, simulation-learning overlap, and collector-side CPU actor-inference cost.

In the traced 500-iteration window, the double-buffer path reduces training time from 107.50 s to 70.58 s, a 34.34% reduction in wall-clock time. After dropping the first five cycles, the mean learner cycle decreases from 211.31 ms to 136.10 ms. With 2048 environment steps per learner cycle, this corresponds to an increase from 9.69k to 15.05k environment steps per second.

The clearest change is on the replay hot path. In the baseline trace, learner/replay_sample takes 3.64 ms on average and includes lazy replay synchronization. In the UniLab trace, learner-side replay consumption is reduced to 0.23 ms on average. Replay preparation still exists, but it is moved out of the learner hot path: CPU packing takes 6.30 ms, and GPU H2D transfer takes 3.13 ms, while 99.50% of collector-active time overlaps with learner updates.

#A.4 Ablating the Path from GPU-Cache SAC to Sample-Before-Transfer

We run a SAC replay-path ablation on the same A100 machine. The four variants preserve SAC's objective and update equations; only the replay boundary changes. The variants form a controlled migration chain: C (GPU-cache compatibility control) → B (modern framework, GPU-cache) → A (sampled-batch transfer, synchronous) → Baseline (pinned double-buffer).

**Figure A.3:** C-to-baseline ablation for the SAC replay path. Wall-clock E2E bars are three-seed means with sample-standard-deviation error bars. Panel C reports learner-side replay-sample and boundary-wait statistics; Panel D reports peak CUDA reserved memory and the GPU-cache component.

Moving from B to A removes the GPU-cache component and reduces peak CUDA reserved memory from 2362 MB to 692 MB. Moving from A to the baseline keeps the low-memory CPU-resident replay design and changes the transfer mechanism: pinned shared pack slots, one-tick asynchronous H2D, and hot/cold GPU slots reduce learner-side replay consumption from 10.19 ms to 0.35 ms. Relative to C, the final baseline reduces wall time from 101.23 s to 85.04 s while also removing the GPU-cache footprint.

#A.5 Buffer and Communication Overhead

**Figure A.4:** SAC buffer and communication overhead. Panel A groups counted data-movement, weight-synchronization, and boundary-wait overhead by share of the mean learner cycle. Panel B reports an auxiliary replay-placement benchmark.

In the optimized trace, the counted data-movement, synchronization, and boundary-wait overhead total is 15.82 ms per cycle, or 11.62% of the 136.10 ms mean learner cycle. Data movement is the largest counted component at 10.07 ms per cycle, weight synchronization contributes 4.79 ms, and residual boundary waiting contributes 0.96 ms.

#Appendix B. Domain Randomization Backends and Lifecycle

Domain randomization in UniLab is implemented as a task/backend contract rather than as an algorithm-level feature. A task-owned DomainRandomizationProvider samples the quantities that are meaningful for the workload, while the simulator backend advertises which physical overrides it can apply. The runtime mediator, DomainRandomizationManager, validates this contract, applies cold-start model variants before backend materialization, injects reset payloads into sparse environment resets, and schedules interval perturbations before physics stepping.

#B.1 Runtime Lifecycle

The important systems detail is that reset-time randomization is sparse: only the environments listed in env_ids receive a new state and a new randomization payload. Interval randomization is different: it is checked once per vectorized environment step, before the backend advances physics.

**Table A.1:** Domain-randomization lifecycle used by the current UniLab runtime.
Lifecycle	Trigger	Owner	Randomized State
Backend initialization	DR init hook before backend `materialize()`	Task provider builds init plan; backend materializes variants	Persistent model or geometry variants (e.g., object-scale via `GeomSizeOverride`)
Sparse reset	Environment creation and later reset of terminated `env_ids`	Task provider samples reset state; backend applies supported fields	Pose, velocity, commands, mass, COM, gravity, friction, actuator gains
Scheduled interval	Each vectorized `step`	Task provider builds interval plan; backend stages perturbation	Push forces and body-force perturbations
Observation construction	Every task observation update	Task code	Actor observation noise, history/bias terms
Evaluation and playback	Same environment contract as training	Training/evaluation entrypoint	Shared contract; deterministic runs disable relevant switches

#B.2 Backend Implementation

MuJoCoUni. MuJoCoUni implements reset-time randomization through BatchEnvPool.reset(env_ids, initial_state, randomization=None) which receives both the new physics state and an optional dictionary of model-field patches. Each payload has leading dimension len(env_ids), so reset cost scales with the number of environments that actually terminate. Fields that affect MuJoCo derived constants are patched before the reset/forward path and refreshed with mj_setConst.

MotrixSim. MotrixSim implements the same task/backend contract with MotrixSim-native override APIs. During set_state, the backend resets the selected data slice, clears staged body forces, applies init-time geometry-size overrides, applies supported reset randomization, writes the new DOF state, and runs forward kinematics. Friction, gravity, and actuator-gain randomization are conditional capabilities: they are enabled only when the loaded MotrixSim model exposes the corresponding override methods.

#B.3 Supported Randomization Families

**Table A.2:** Supported domain-randomization families and backend-specific limits.
Family	Lifecycle	MuJoCoUni	MotrixSim
Model/geometry variants	Init	Precompiled MjModel variants with per-env assignments	Per-env geometry-size overrides
Initial state & task conditions	Reset	Backend receives sampled qpos/qvel	Same contract after DOF layout conversion
Base/body mass	Reset	base_mass_delta and full body_mass	Base-link mass delta and full link-mass override
Gravity	Reset	gravity payload	Conditional gravity override
Contact friction	Reset	Full geom_friction payload	Conditional collision-geom friction overrides
Actuator gains	Reset	kp and kd payloads	Conditional per-actuator Kp/damping overrides
External perturbations	Interval	Push and body-force via xfrc_applied	Push forces through link external force
Observation noise	Obs step	Task-side NumPy noise	Same task-side path

#B.4 Implications for Cross-Backend Experiments

The shared contract lets a task express randomization once, but the effective randomization set is still backend-dependent. A randomization item should be interpreted as active only when the task configuration enables it and the selected backend advertises support for the corresponding field. MuJoCoUni exposes a wider reset-field surface for inertial fields, whereas MotrixSim can match many common locomotion and manipulation settings through link, geom, gravity, actuator, and external-force override APIs when the loaded model supports them.

#Appendix C. Task and Algorithm Details

Detailed per-task specifications (observation space, action space, reward weights, domain randomization) and per-algorithm hyperparameter tables are provided in the full paper PDF. This appendix covers locomotion, motion tracking, manipulation-locomotion, and dexterous-hand task families, as well as PPO, APPO, and SAC algorithm configurations.

PPO task grid — **Figure C.1:** PPO training curves across representative tasks.

APPO task grid — **Figure C.2:** APPO training curves across representative tasks.

SAC task grid — **Figure C.3:** SAC training curves across representative tasks.

#Citation

UniLab

@article{jia2026unilab,
  title         = {UniLab: A Heterogeneous Architecture for Robot RL Beyond GPU-Dominant Paradigms},
  author        = {Yufei Jia and Zhanxiang Cao and Mingrui Yu and Heng Zhang and Shenyu Chen and Dixuan Jiang and Meng Li and Xiaofan Li and Yiyang Liu and Junzhe Wu and Zheng Li and XiLin Fang and Ting-Yu Tsui and Shengcheng Fu and Haoyang Li and Anqi Wang and Zifan Wang and Dongjie Zhu and Chenyu Cao and Zhenbiao Huang and Ziang Zheng and Jie Lu and Xin Ma and Zhengyang Wei and Xiang Zhao and Tianyue Zhan and Ye He and Yuxiang Chen and Yizhou Jiang and Yue Li and Haizhou Ge and Yuhang Dong and Fan Jia and Ziheng Zhang and Meng Zhang and Xiwa Deng and Zhixing Chen and Hanyang Shao and Chenxin Dong and Yixuan Li and Yizhi Chen and Bokui Chen and Kaifeng Zhang and Hanqing Cui and Yusen Qin and Ruqi Huang and Lei Han and Tiancai Wang and Xiang Li and Yue Gao and Guyue Zhou},
  journal       = {arXiv preprint arXiv:2605.30313},
  year          = {2026},
  url           = {https://arxiv.org/abs/2605.30313}
}

UniLab: 面向机器人强化学习的异构训练架构——超越GPU主导范式

* Core contributors. † Advising. Correspondence: Yufei Jia <jyf23@mails.tsinghua.edu.cn>

关键词：机器人强化学习、系统、异构训练

UniLab 代表性机器人控制任务 — **图 1：**UniLab 中的代表性机器人控制任务；"Uni"意为统一的跨平台训练。预览图由 MotrixSim 渲染。

#摘要

当代机器人控制中的 simulation-based RL 训练越来越围绕 GPU 驻留仿真组织：物理仿真、rollout 采集和学习被放在同一条 GPU-centric 执行路径上。这一范式显著提升了训练速度，也逐渐形成一种默认认知：高效训练需要将物理仿真驻留在 GPU 上。本文重新审视这一认知。我们的观点是，在 simulation-dominated 的机器人控制训练中，本质问题不是物理仿真位于哪种处理器上，而是仿真吞吐、策略学习和运行时同步能否组成高效的端到端闭环。

我们提出 UniLab，一种 CPU 仿真 / GPU 学习的异构训练架构，通过统一运行时解耦 CPU 并行仿真与 GPU 策略更新，并协调数据移动、缓冲与同步。UniLab 被实现为完整且可扩展的训练系统，接入 MuJoCoUni 和 MotrixSim CPU 批量物理后端，支持 PPO、FastSAC、FlashSAC 和 APPO。在代表性的 simulation-based 机器人控制任务上，UniLab 在相同硬件配置下实现了 $3\text{--}10\times$ 的端到端训练效率提升，同时减少对 NVIDIA CUDA 软件栈的依赖，并支持 Apple macOS 平台以及 AMD ROCm 和 Intel XPU 加速后端上的跨平台执行。

结果表明，GPU 仿真是高效训练的一条有效路径，但不是必要条件，从而拓宽了机器人 RL 训练可选择的实际系统路径。项目主页：https://unilabsim.github.io。

#1. 引言

训练基础设施已经成为 simulation-based 机器人 RL 研究效率的一阶因素：更快训练降低单次实验的 wall-clock 成本，缩短系统与算法迭代周期，并扩大现实硬件预算下可探索的任务范围。近年的主流答案非常清晰：将物理仿真、rollout 采集和学习尽量放在 GPU-centric 执行路径上；Isaac Gym、Isaac Lab、MuJoCo Playground、mjlab、ManiSkill3 和 Genesis 等系统表明，大规模 GPU 驻留环境并行能够显著加速机器人控制训练。这一成功塑造了当前社区默认认知：高效训练应围绕 GPU 驻留物理组织，并由此更紧密地绑定到较窄的一组 GPU 驻留软件环境。

但机器人 RL 训练是由数据生成、策略更新和同步约束组成的闭环系统，而不只是 simulator benchmark。在 simulation-dominated 任务中，端到端效率取决于仿真吞吐、learner 利用率、collector/learner 同步、数据移动与缓冲开销，以及硬件是否分配给真正限制 wall-clock 的环节。因此，physics 是否在 GPU 上只是更大系统组织问题中的一个设计选择。

GPU 驻留物理之外也存在高吞吐环境执行路径。通用 RL 系统已经广泛使用 CPU 侧向量化或批量化环境执行，机器人 RL 中也存在 CPU 分布式或 CPU 并行仿真的先例。算法数据依赖进一步影响这种组织方式：PPO 的 on-policy 更新保留最强的 rollout/update 同步约束；APPO 在 on-policy 附近允许采样与学习重叠；FastSAC 和 FlashSAC 等 off-policy 方法进一步放松每次更新对最新策略轨迹的依赖。本文关注的系统问题是：CPU 侧批量刚体仿真、GPU 侧策略学习和二者之间的运行时协调能否组成高效的端到端训练路径。

本文直接研究这一问题：在 simulation-based 机器人控制训练中，高效训练是否必须依赖 GPU 驻留仿真？我们的论点是，在 simulation-dominated 机器人控制训练中，关键需求是高吞吐且协调良好的 simulation-learning 执行，而不是 GPU 驻留仿真本身。

我们提出 UniLab，一种 CPU 仿真 / GPU 学习的异构训练架构。CPU 侧通过 MuJoCoUni 和 MotrixSim 后端执行并行批量刚体仿真与数据生成，GPU 侧负责策略和价值学习，统一运行时协调数据移动、缓冲和同步。UniLab 是一种训练系统组织方式，而不是新的策略优化算法；其实现是完整且可扩展的训练系统，同一框架支持 PPO、FastSAC、FlashSAC 和 APPO。

在代表性机器人控制仿真基准上，UniLab 在相同单 GPU/单 CPU 工作站上实现了 $3\text{--}10\times$ 的端到端训练效率提升，同时减少对 NVIDIA CUDA 软件栈的依赖并支持 Apple macOS、AMD ROCm 和 Intel XPU 后端。本文贡献有三点：

系统视角。我们将高效机器人 RL 训练重新表述为 simulation-learning 闭环的系统组织问题，而不是 GPU 驻留物理的必然结果。
异构训练架构。我们提出 UniLab，通过统一运行时连接 CPU 批量物理后端、GPU learner、数据缓冲和参数同步，并在同一框架中支持 PPO、FastSAC、FlashSAC 和 APPO。
端到端证据。我们在多类机器人本体、控制负载和实用算法上展示 $3\text{--}10\times$ wall-clock 收益，并给出跨 macOS、ROCm 和 XPU 的执行证据。

#2. 相关工作

#2.1 GPU驻留机器人学习

**表 1：**代表性机器人 RL 训练系统。
系统	物理引擎	批量执行	耦合方式
`IsaacGym`	PhysX	GPU-C	GPU-sync
`IsaacLab`	PhysX	GPU-C	GPU-sync
`Genesis`	Taichi	GPU-C/M/R	GPU-sync
`MJP`	MJX	GPU-C	GPU-sync
`MjLab`	MJWarp	GPU-C	GPU-sync
UniLab	MJU/Mtx	CPU	H-async/sync

注：GPU-C/M/R: 基于 CUDA/Metal/ROCm 的 GPU 批量物理。GPU-sync: 同步 GPU 仿真-学习；H-async/sync: CPU 仿真 + GPU 学习。

近年来，高效机器人 RL 训练的主流系统路径是将物理仿真、rollout 采集和学习放在 GPU-centric 执行路径上。MuJoCo 提供了广泛使用的控制仿真基础，而 Isaac Gym、Isaac Lab、MuJoCo Playground、mjlab、ManiSkill3 和 Genesis 等 GPU 加速栈使大规模环境并行成为常规实践。

#2.2 GPU仿真的系统经验

GPU 驻留系统的核心经验是把快速物理执行与紧耦合的 rollout 采集和 learner 更新组织在一起。对于 PPO 等 on-policy 方法，这种组织方式很好地匹配同步批量 rollout/update 循环。本文继承这一系统经验，但将训练系统组织原则与具体硬件路径分开：高效训练需要低开销的数据生成、学习和同步，而 GPU kernel 更擅长规则、密集和静态形状的执行；动态接触集合、稀疏交互、碰撞检测、接触求解和约束处理会给这种执行模型带来压力。

#2.3 CPU并行环境执行

GPU 驻留物理之外也存在高吞吐环境执行路径。在通用 RL 中，EnvPool、RLlib、Tianshou 和 PufferLib 等系统已经将 CPU 侧向量化、批量化或并行 rollout 采集作为核心组件。机器人 RL 中也有 CPU 分布式或 CPU 并行仿真的先例。这些工作说明 CPU 环境并行一直可行；UniLab 进一步检验的是，在相同硬件设置下，现代 CPU 批量仿真与 GPU learner 是否能通过低开销运行时形成高效端到端训练路径。

#2.4 基于回放的机器人控制加速

算法的数据依赖进一步影响系统组织。PPO 是许多大规模机器人训练工作负载中的实用默认选项，但其 on-policy 更新保留 rollout 生成与 learner 更新之间的强同步约束。SAC 和 TD3 等 replay-based 方法能够复用过去经验，从算法层面放松每次更新对最新策略轨迹的依赖；FastTD3、FastSAC 和 FlashSAC 进一步展示了这种方向在高维机器人控制中的有效性。UniLab 研究与之互补的系统问题：当数据依赖被放松时，CPU 仿真与 GPU 学习如何通过运行时协调获得端到端 wall-clock 收益。

#3. UniLab 架构

本节给出 UniLab 的系统设计。该架构将 CPU 侧批量刚体仿真、GPU 侧策略/价值学习和统一运行时的数据路径协调作为一个端到端训练闭环来组织。

**图 2：**UniLab 系统架构。该图展示 CPU 侧批量物理仿真后端、统一运行时和 GPU learner 之间的数据、调度和参数同步路径。

#3.1 设计目标与需求

设计目标是在不依赖 GPU 驻留仿真的情况下提升完整 simulation-learning 循环的效率。UniLab 按硬件特性组织系统：CPU 生成大规模仿真数据，GPU 执行密集学习更新，运行时降低两侧协调成本。这一目标产生三个要求：

CPU 侧仿真吞吐。CPU 侧批量刚体仿真必须提供足够吞吐，持续生成本文工作负载所需的数据。

非阻塞 GPU 学习。GPU learner 应通过缓冲数据持续消费经验，而不是长期等待 rollout 生成。

可控运行时开销。数据移动、缓冲和参数同步的开销必须足够低，避免异构拆分退化成阻塞式 handoff。

#3.2 执行架构

系统组织由三部分组成：CPU workers 生成 trajectories 或 transitions，GPU learner 执行策略与价值更新，统一运行时协调数据移动、缓冲、调度与参数同步。

采集-更新时序与重叠。UniLab 在同一架构下支持同步和松耦合的采样-更新时序。标准 PPO 采用同步 rollout/update 循环；APPO 实现遵循异步 on-policy 形式，collector 将固定时长 rollout 写入共享 ring buffer，同时继续推进下一个 rollout；learner 消费可用 rollout 并在 GPU 上执行 V-trace correction 和 PPO-style 更新。FastSAC 和 FlashSAC 使用 replay-based 时序：collector 将 transition batch 写入共享 replay buffer，learner 从 device batch 执行多次更新。

运行时抽象。统一运行时使同步和松耦合执行共享同一系统栈，并通过显式接口连接机器人资产、任务配置、仿真后端和学习算法。

#3.3 CPU物理后端与任务接口

批量 CPU 物理。UniLab 通过后端原生的批量环境执行实现 CPU 侧吞吐；CPU workers 以 batch 粒度并行推进环境并生成 trajectories 或 transitions。

后端契约。当前系统在共享运行时契约下接入 MuJoCoUni 和 MotrixSim 两个实用 CPU 侧仿真后端。

任务与随机化接口。该契约覆盖任务状态、动作、观测相关数据、reset 与 interval 随机化 hooks、地形上下文和回放能力。该设计将物理语义与训练吞吐分离；同一 learner binding 也可映射到 macOS、ROCm 和 XPU。

#4. 实验

本节评估三个问题：CPU 仿真是否提供足够吞吐，CPU 仿真 / GPU 学习是否提升端到端 wall-clock 效率，以及结果是否跨任务家族和算法保持稳健。

#4.1 实验设置

默认比较使用主 Linux 平台上的相同硬件配置：一块 NVIDIA RTX 4090 GPU、一颗 AMD Ryzen 9 9950X3D CPU，以及 64GB 4800 MT/s 内存。任务覆盖 locomotion、motion tracking、manipulation 和 manipulation-locomotion，以及四足、轮足、人形和灵巧手本体。算法按同步约束组织：PPO（严格同步）、APPO（可重叠的 near-on-policy）、FastSAC/FlashSAC（replay-based producer-consumer）。

#4.2 CPU仿真吞吐能否满足机器人RL需求？

在常见机器人 RL 训练设置下，CPU 物理并不必然低于 GPU 仿真吞吐；在复杂接触和灵巧手操作等负载中，CPU 批量仿真的相对优势更为明显。批量 CPU simulation 能够提供异构执行所需的 simulator-side capacity。

**表 2：**CPU 环境步进吞吐（$10^3$ 步/秒），按任务和芯片分类。
芯片	Go2		G1		灵巧手
芯片	MJ	Motrix	MJ	Motrix	MJ	Motrix
A18 Pro	55.7	122.9	28.4	18.1	183.9	134.1
M5 Max	288.0	797.8	178.8	127.7	1118.4	982.9
R9-8945HX	246.2	704.2	154.6	113.6	434.1	542.2
TR-9980X	915.9	2662.7	517.9	410.4	1991.5	2622.6
i7-11800H	82.1	162.0	34.7	23.8	176.8	151.6
Xeon 8558	1002.4	847.2	424.6	379.5	2566.3	397.7

注：数值单位为 $10^3$ 环境步/秒；MJ = MuJoCoUni 后端。

#4.3 CPU仿真/GPU学习能否提升端到端效率？

在 CPU 侧吞吐足以支撑严格同步 PPO 后，关键问题转向异构组织能否在更松耦合的数据依赖中转化为端到端收益。通过运行时解耦 learner 和 collector 后，这些较松耦合设置在多个机器人控制任务上取得了 $3\text{--}10\times$ 的端到端训练效率提升。

**图 6：**训练周期放置消融。Holosoma 是本文所用 FastSAC 代码库，MjWarp 是其 MuJoCo Warp 后端。该图比较一个 learner cycle 内 simulation collection 和 learning 的执行位置。

#4.4 灵巧手内旋转作为系统压力测试

SharpaWaveHand 手内旋转补充了 locomotion 和 motion tracking 之外的接触密集证据。在该任务上，CPU MuJoCo 版本训练效果更好，UniLab 也能在更短 wall-clock 下得到更强的 HORA teacher。该任务使用 22-DOF 触觉手旋转随机自由物体，展示 UniLab 在灵巧手 teacher 训练中对高密度仿真、稳定学习和不同同步约束的支持。

#4.5 跨平台验证

最后，我们报告 Apple macOS、AMD ROCm 和 Intel XPU 结果，以证明该异构组织方式在单一 CUDA-centric 设置之外仍具备实用可训练性。跨平台执行是 UniLab 接口设计带来的实际结果。

**表 3：**Wall-clock 训练时间（分钟）。
设备	FastSAC / G1 WBT	FastSAC / G1 Walk	FlashSAC / Go2 Joy.	PPO / G1 Flip
RTX 4090（基线）	58.8	18.3	6.0	109.0
RTX 4090 + AMD 9950X3D	18.5	3.0	1.1	16.4
AMD 8060S + AMD AI MAX 395	33.6	9.4	4.2	19.6
M5 Max	75.0	18.8	4.5	16.8

#5. 结论

本文提出 UniLab，一种 CPU 仿真 / GPU 学习的异构机器人 RL 训练架构。通过统一运行时协调数据移动、缓冲和同步，UniLab 在多类机器人本体、控制负载和实用算法上实现了 $3\text{--}10\times$ 的端到端训练效率提升，同时减少对 NVIDIA CUDA 软件栈的依赖并支持 Apple macOS、AMD ROCm 和 Intel XPU 后端。结果表明，高效训练依赖高吞吐且协调良好的 simulation-learning 执行，而不必要求物理驻留在 GPU 上；因此，UniLab 构成了一个系统性反例，说明高效训练的设计空间比当前 GPU-centric 默认实践所暗示的更宽。

#6. 讨论

本文并不主张 GPU 驻留仿真已经过时；当 simulator 吞吐不再是主导瓶颈，或更大的加速器密集配置更合适时，GPU 仿真仍可能更优。UniLab 只是扩展了 simulation-dominated 机器人控制的设计空间。

GPU-centric 栈的速度来自两个相互配合的设计：simulation、rollout 采集和学习共享低开销执行路径，同时物理后端被组织成适合 GPU 的大规模并行计算。前者是训练系统组织原则，后者只是实现它的一条硬件路径。因此，本文挑战的不是 GPU simulator 的价值，而是"高效机器人 RL 训练必须采用 GPU 驻留物理"这一必要性判断。

#7. 局限性

当前工作的局限主要来自三个假设。第一，UniLab 最适合 simulation-dominated、且仿真与学习可有效解耦的训练；在严格同步流程或视觉主导的任务中，CPU/GPU 解耦可能无法隐藏主要开销。第二，本文评估的是受控单 CPU/单 GPU 设置下的端到端训练效率，而不是极端并行规模下的绝对峰值吞吐。第三，当前实现主要覆盖刚体机器人控制，不涉及可变形物体、软体或流体。未来工作应把同一运行时分析扩展到视觉主导任务、更大规模系统和非刚体物理。

#致谢

感谢 Apple 和 AMD 提供用于开发和评测的硬件平台，并协助平台适配。我们也诚挚感谢 mjlab 团队开源其优秀工作，其工程实践为本项目提供了有价值的参考。感谢 UniLab 早期用户以及清华大学 2026 年春季学期深度强化学习课程同学的使用和反馈。

#参考文献

点击展开参考文献（34 条）

V. Makoviychuk, L. Wawrzyniak, Y. Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, et al. Isaac Gym: High performance GPU-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470, 2021.
M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Muñoz, X. Yao, R. Zurbrügg, N. Rudin, et al. Isaac Lab: A GPU-accelerated simulation framework for multi-modal robot learning. arXiv preprint arXiv:2511.04831, 2025.
K. Zakka, B. Tabanpour, Q. Liao, M. Haiderbhai, S. Holt, J. Y. Luo, A. Allshire, E. Frey, K. Sreenath, L. A. Kahrs, et al. MuJoCo Playground. arXiv preprint arXiv:2502.08844, 2025.
S. Tao, F. Xiang, A. Shukla, Y. Qin, X. Hinrichsen, X. Yuan, C. Bao, X. Lin, Y. Liu, T. Chan, et al. ManiSkill3: GPU parallelized robotics simulation and rendering for generalizable embodied AI. Robotics: Science and Systems, 2025.
Genesis Authors. Genesis: A generative and universal physics engine for robotics and beyond, December 2024.
J. Weng, M. Lin, S. Huang, B. Liu, D. Makoviichuk, V. Makoviychuk, Z. Liu, Y. Song, T. Luo, Y. Jiang, et al. EnvPool: A highly parallel reinforcement learning environment execution engine. NeurIPS, 35:22409–22421, 2022.
Z. Wu, E. Liang, M. Luo, S. Mika, J. E. Gonzalez, and I. Stoica. RLlib Flow: Distributed reinforcement learning is a dataflow problem. NeurIPS, 2021.
J. Weng, H. Chen, D. Yan, K. You, A. Duburcq, M. Zhang, Y. Su, H. Su, and J. Zhu. Tianshou: A highly modularized deep reinforcement learning library. JMLR, 23(267):1–6, 2022.
J. Suarez. PufferLib 2.0: Reinforcement learning at 1M steps/s. Reinforcement Learning Journal, 6:1378–1388, 2025.
I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, et al. Solving Rubik's cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
Y. Kim, H. Oh, J. Lee, J. Choi, G. Ji, M. Jung, D. Youm, and J. Hwangbo. Not only rewards but also constraints: Applications on legged robot locomotion. IEEE Transactions on Robotics, 40:2984–3003, 2024.
O. Pearce. Exploring utilization options of heterogeneous architectures for multi-physics simulations. Parallel Computing, 87:35–45, 2019.
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. ICML, pages 1861–1870, 2018.
Y. Jia and J. Wu. MuJoCoUni: Persistent batched runtime primitives for MuJoCo. arXiv preprint arXiv:2605.24922, 2026.
Motphys Team. MotrixSim: A physics simulation engine for robotics and embodied AI, 2026.
C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, and O. Bachem. Brax – a differentiable physics engine for large scale rigid body simulation. arXiv preprint arXiv:2106.13281, 2021.
J. Liang, V. Makoviychuk, A. Handa, N. Chentanez, M. Macklin, and D. Fox. GPU-accelerated robotic simulation for distributed reinforcement learning. CoRL, pages 270–282, 2018.
E. Todorov, T. Erez, and Y. Tassa. MuJoCo: A physics engine for model-based control. IROS, pages 5026–5033, 2012.
J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun, and M. Hutter. Learning agile and dynamic motor skills for legged robots. Science Robotics, 4(26):eaau5872, 2019.
G. B. Margolis and P. Agrawal. Walk these ways: Tuning robot control for generalization with multiplicity of behavior. CoRL, pages 22–31, 2023.
G. B. Margolis, G. Yang, K. Paigwar, T. Chen, and P. Agrawal. Rapid locomotion via reinforcement learning. IJRR, 43(4):572–587, 2024.
Z. Wang, Y. Jia, L. Shi, H. Wang, H. Zhao, X. Li, J. Zhou, J. Ma, and G. Zhou. Arm-constrained curriculum learning for loco-manipulation of a wheel-legged robot. IROS, pages 10770–10776, 2024.
T. He, J. Gao, W. Xiao, Y. Zhang, Z. Wang, J. Wang, Z. Luo, G. He, N. Sobanbab, C. Pan, et al. ASAP: Aligning simulation and real-world physics for learning agile humanoid whole-body skills. arXiv preprint arXiv:2502.01143, 2025.
Z. Cao, L. Yan, Y. Zhang, S. Chen, J. Ma, T. Zhan, S. Fu, Y. Jia, C. Lu, and Y. Gao. HiWET: Hierarchical world-frame end-effector tracking for long-horizon humanoid loco-manipulation. arXiv preprint arXiv:2602.06341, 2026.
S. Bharthulwar, S. Tao, and H. Su. Staggered environment resets improve massively parallel on-policy reinforcement learning. NeurIPS, 38:133342–133375, 2026.
A. A. Shahid, Y. Narang, V. Petrone, E. Ferrentino, A. Handa, D. Fox, M. Pavone, and L. Roveda. Scaling population-based reinforcement learning with GPU accelerated simulation. arXiv preprint, 2024.
S. Fujimoto, H. Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. ICML, pages 1587–1596, 2018.
Y. Seo, C. Sferrazza, H. Geng, M. Nauman, Z.-H. Yin, and P. Abbeel. FastTD3: Simple, fast, and capable reinforcement learning for humanoid control. arXiv preprint arXiv:2505.22642, 2025.
Y. Seo, C. Sferrazza, J. Chen, G. Shi, R. Duan, and P. Abbeel. Learning sim-to-real humanoid locomotion in 15 minutes. arXiv preprint arXiv:2512.01996, 2025.
D. Kim, Y. Lee, M. Park, K. Kim, I. Nahendra, T. Seno, S. Min, D. Palenicek, F. Vogt, D. Kragic, et al. FlashSAC: Fast and stable off-policy reinforcement learning for high-dimensional robot control. arXiv preprint arXiv:2604.04539, 2026.
M. Luo, J. Yao, R. Liaw, E. Liang, and I. Stoica. IMPACT: Importance weighted asynchronous architectures with clipped target networks. arXiv preprint arXiv:1912.00167, 2019.
Google DeepMind. MuJoCo Warp (MJWarp), 2026.
K. Zakka, Q. Liao, B. Yi, L. Le Lay, K. Sreenath, and P. Abbeel. mjlab: A Lightweight Framework for GPU-Accelerated Robot Learning. arXiv preprint arXiv:2601.22074, 2026.

#引用

UniLab

@article{jia2026unilab,
  title         = {UniLab: A Heterogeneous Architecture for Robot RL Beyond GPU-Dominant Paradigms},
  author        = {Yufei Jia and Zhanxiang Cao and Mingrui Yu and Heng Zhang and Shenyu Chen and Dixuan Jiang and Meng Li and Xiaofan Li and Yiyang Liu and Junzhe Wu and Zheng Li and XiLin Fang and Ting-Yu Tsui and Shengcheng Fu and Haoyang Li and Anqi Wang and Zifan Wang and Dongjie Zhu and Chenyu Cao and Zhenbiao Huang and Ziang Zheng and Jie Lu and Xin Ma and Zhengyang Wei and Xiang Zhao and Tianyue Zhan and Ye He and Yuxiang Chen and Yizhou Jiang and Yue Li and Haizhou Ge and Yuhang Dong and Fan Jia and Ziheng Zhang and Meng Zhang and Xiwa Deng and Zhixing Chen and Hanyang Shao and Chenxin Dong and Yixuan Li and Yizhi Chen and Bokui Chen and Kaifeng Zhang and Hanqing Cui and Yusen Qin and Ruqi Huang and Lei Han and Tiancai Wang and Xiang Li and Yue Gao and Guyue Zhou},
  journal       = {arXiv preprint arXiv:2605.30313},
  year          = {2026},
  url           = {https://arxiv.org/abs/2605.30313}
}