MuJoCoUniPersistent Batched Runtime Primitives for MuJoCo
We present MuJoCoUni, a downstream MuJoCo distribution for online robot learning and batched physics evaluation. Alongside the open-loop batched trajectory generation already provided by upstream mujoco.rollout, MuJoCoUni supplies runtime primitives for stateful environment execution. Its core object, BatchEnvPool, is a C++/pybind11 executor that owns per-environment mjModel copies, per-thread mjData workers, and an internal thread pool. It provides final-state-only short stepping, sparse reset, reset-lifecycle domain randomization, batched sensor forward evaluation without advancing dynamics, and batched Jacobian and height-field queries. The implementation is confined to the Python binding layer; MuJoCo's solver, contact model, integrator, and core source tree retain upstream semantics. This report describes the BatchEnvPool API, implementation boundary, relationship to rollout, and the validation and benchmark scripts shipped with mujoco-uni (pip install mujoco-uni).
1. Introduction
Robot-learning systems increasingly place the physics simulator inside the training loop. The runtime sends batched controls, advances a short time window, reads sensors and task state, and resets only terminated environments. MuJoCo already provides mature XML/MJB assets, sensors, contact solving, and debugging tools; when these fine-grained operations run at high frequency, interface overhead, object lifetime, and output shape directly affect training efficiency.
GPU-resident simulators and GPU-oriented MuJoCo backends are important paths for efficient training. When a task also needs upstream CPU MuJoCo behavior for models, sensors, contact or constraint handling, or debugging, a CPU-side batched runtime provides a complementary route.
Upstream MuJoCo already provides batched stepping through the official mujoco.rollout interface. It uses a C++ thread pool to run open-loop mj_step from many initial states and returns full state and sensor trajectories. Importantly, the persistence in rollout is limited to optional thread-pool reuse; environment models, data, state updates, reset semantics, and randomization lifecycles remain external to the call.
Online robot RL also needs an environment-runtime interface. The runtime should preserve environments and model variants across calls, return only the final state after short stepping windows, and apply sparse reset with domain randomization for terminated environments. Observation and control computation further need batched sensor forward passes, site Jacobians, and local terrain-height queries without advancing dynamics.
MuJoCoUni is a lightweight downstream distribution of MuJoCo with additions concentrated in the Python binding layer. Its core object, BatchEnvPool, creates per-environment mjModel copies, per-thread mjData workers, and an internal thread pool. It exposes step, forward, reset, compute_site_jacobians, and sample_hfield_height; MuJoCo's physics kernel and solver are unchanged.
The contribution of this report is engineering-oriented. We describe the complementary relationship between MuJoCoUni and upstream rollout, present the persistent environment pool and reset/forward/query primitives, and summarize the repository scripts for numerical parity, field-patching tests, and micro-benchmarks.
2. Related Work
2.1 Upstream MuJoCo Batching
The closest interface to MuJoCoUni is MuJoCo's upstream mujoco.rollout module. rollout generates open-loop trajectories from a batch of initial states and control sequences, supports single-threaded or thread-pool execution, and returns state and sensor arrays with shape nbatch × nstep × dim. The center of the rollout abstraction is "generate a full trajectory from input tensors"; the center of the MuJoCoUni abstraction is "maintain a repeatedly interactive environment pool."
rollout fits full-trajectory tasks such as planning, system identification, and trajectory optimization. BatchEnvPool is complementary when tasks need per-environment models to persist across calls, short steps to return only final states, sparse reset-time patches, or batched current-state queries.
2.2 Vectorized Environment Runtimes
Vectorized environment runtimes organize many environments behind one interface and are a common engineering layer in RL systems. EnvPool demonstrates the value of moving environment execution into a high-performance C++ runtime, and robot benchmarks such as ManiSkill expose batched task interfaces. MotrixSim shows a systems route that combines CPU-parallel simulation with reinforcement-learning algorithms for robot policy training. MuJoCoUni occupies a lower-level position: it extends the MuJoCo binding layer for systems that need the standard mjModel workflow, persistent model pools, reset-time domain randomization, and batched physics queries.
2.3 GPU-Resident Physics
Brax implements a vectorizable and differentiable physics kernel in JAX; MJX maps a subset of MuJoCo to JAX; Isaac Gym and Isaac Lab provide NVIDIA GPU-resident simulation through PhysX; Genesis and MuJoCo Warp also target GPU-side physics execution. These systems can provide high throughput at large parallel scales, but GPU paths typically require models, contact and constraint handling, and data layout to fit an accelerator-friendly execution model.
MuJoCoUni takes a complementary route. It preserves MuJoCo CPU physics semantics and concentrates batched execution plus common robot-task queries in the C++ binding layer. It is not a replacement claim against GPU-resident simulation; it is a CPU-batched backend for MuJoCo workloads where feature coverage matters more than accelerator residency.
2.4 Domain Randomization
Domain randomization is a basic technique for sim-to-real training and robust policy search. Standard MuJoCo Python workflows typically copy or mutate mjModel fields and call mj_setConst when required. MuJoCoUni moves common field patches into BatchEnvPool.reset, so sparse reset can handle both state reset and per-environment randomization.
2.5 Evolutionary and Optimization Workloads
Evolutionary computing, neuroevolution, and model search also rely on large numbers of physics evaluations. MuJoCoUni's persistent model pools, model-variant initialization, and final-state return semantics fit workloads that evaluate many candidate bodies or controllers in parallel.
3. System Design and API
3.1 Design Boundary
MuJoCoUni has a narrow design boundary: it adds a batched runtime inside the MuJoCo Python package without changing the physics kernel. Throughput improvements come from object lifetime, thread scheduling, and batched interfaces rather than from reducing the MuJoCo physics feature set. The core additions are batch_env.cc and batch_env.py.
3.2 Pool Construction
BatchEnvPool(model, *, nbatch, nthread=None) accepts either one MjModel or a compatible model sequence. The constructor creates one model copy per environment with mj_copyModel and one mjData per worker thread. When nthread > 0, an internal thread pool assigns chunks of environment indices to workers.
This supports parameter-level randomization through reset-time field patches and geometry-level randomization through precompiled MjModel variants (link lengths, mesh scales, collision geometry).
3.3 Execution Primitives
| Primitive | Input | Output / Purpose |
|---|---|---|
step | (N, nstate), nstep, control | Final state (N, nstate); optional sensordata |
forward | (N, nstate) | Sensordata (N, nsensordata) without advancing dynamics |
reset | env_ids, states, randomization | Reset state/sensors for selected environments |
compute_site_jacobians | state, site ids | Batched translational/rotational Jacobians |
sample_hfield_height | state, geom id, XY offsets | Batched terrain heights or clearances |
Batched stepping. step(initial_state, nstep, control=None) runs mj_step for nstep steps on every environment. Controls are (N, nstep, ncontrol). With return_sensor=True, final-step sensordata is also returned.
Forward evaluation. forward(initial_state) runs one mj_forward over all environments and returns sensors without advancing dynamics.
Sparse reset. reset(env_ids, initial_state, randomization=None) acts only on selected environments. Cost scales with the number of terminated environments.
Site Jacobians. compute_site_jacobians computes jacp and/or jacr for one or more sites. Output shape: (N, K, 3, nv).
Height-field sampling. sample_hfield_height bilinearly samples a MuJoCo hfield geom. Output is terrain height or clearance.
3.4 Reset-Time Domain Randomization
The reset randomization payload is a dictionary from field name to float64 arrays. Fields requiring refresh trigger mj_setConst after patching.
| Field | mj_setConst | Use case |
|---|---|---|
body_mass | yes | Body mass and payload randomization |
body_ipos | yes | Inertial-frame COM offsets |
body_iquat | yes | Inertial-frame orientation perturbations |
body_inertia | yes | Inertia tensor randomization |
dof_armature | yes | Joint armature perturbations |
gravity | no | Per-env gravity vectors |
geom_friction | no | Contact friction randomization |
kp, kd | no | Position-actuator gain randomization |
4. Validation and Benchmarks
This section reports MuJoCoUni benchmarks on four MuJoCo models using the discardvisual compiler option. All data collected on Intel i9-14900HX, Ubuntu 20.04, MuJoCoUni 3.8.0, Python 3.13, NumPy 2.4, 16 simulation threads.
4.1 Step and Forward Throughput
Four models tested: Unitree Go1 (18 DoF), Wonik Allegro (16 DoF), Franka Panda (9 DoF), CMU Humanoid (56 DoF). Throughput saturates around 256–512 environments. At saturation, Allegro reaches ~1.8M steps/s, Go1 ~1.2M, Franka ~410k, Humanoid ~290k.
4.2 Model-Variant Overhead
When each environment owns a distinct mjModel copy, cache locality decreases slightly. At saturation (256–512 environments) the gap closes and throughput is essentially identical.
4.3 Reset Performance
At 4096 environments, the C++ path completes a full reset in 3.5 ms vs. 53 ms for a Python loop — a ~15× speedup. The C++ path scales linearly with reset fraction.
4.4 Batched Jacobian Performance
The C++ pool computes Jacobians for 4096 environments in 0.53 ms vs. 11.9 ms for a Python loop — a ~22× speedup.
4.5 Height-Field Sampling Performance
At 4096 environments, the C++ path takes 0.52 ms vs. 290 ms for a Python loop — a ~555× speedup.
5. Applications
5.1 Robot Reinforcement Learning
Robot RL is the primary target workload. BatchEnvPool gathers short-horizon stepping, sensor reads, sparse reset, reset-time domain randomization, and current-state queries into one MuJoCo-side object. Downstream systems can consume final states through synchronous batch sampling, asynchronous collection, or offline data generation.
5.2 Sim-to-Real Domain Randomization
MuJoCoUni places common MuJoCo field patches and required mj_setConst refreshes inside reset; geometry-level changes are represented by precompiled model variants at construction time.
5.3 Terrain-Aware Locomotion
sample_hfield_height samples MuJoCo hfield data in batch, supports yaw/world/body alignment, and returns either terrain height or frame clearance.
5.4 Manipulation and Kinematic Control
compute_site_jacobians runs the minimal kinematic prefix over the full pool and calls mj_jacSite in batch, supporting operational-space control, reward computation, constraint checks, and IK auxiliary objectives.
5.5 Batch Optimization
MuJoCoUni's persistent model pools, model-variant initialization, and final-state-only step fit evaluation loops whose objective depends on final state, terminal events, or aggregated rewards.
6. Discussion
6.1 Runtime Boundary and Tradeoffs
Per-environment model copies increase memory use, geometry-level randomization requires precompiled compatible models, and reset-time field patching covers the currently registered field set. The corresponding benefits are clear object ownership, lower-frequency Python interaction, and simulator-side interfaces embeddable in different systems.
6.2 System Context
GPU-resident simulation and CUDA stacks (Isaac Gym, Isaac Lab, MuJoCo Playground, Genesis) demonstrate large-scale GPU-parallel training efficiency. CPU MuJoCo preserves mature XML/MJB assets, sensors, debugging, and visualization workflows. For workloads needing full MuJoCo feature coverage, cross-platform deployment, or reuse of existing assets, MuJoCoUni provides a concrete CPU-batched engineering path.
6.3 Availability
MuJoCoUni is released as the open-source mujoco-uni Python package with unit tests and parity checks. The benchmark code is at github.com/unilabsim/mujoco_uni_bench.
References (12)
- [1] Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. IROS, 2012.
- [2] Jiayi Weng et al. EnvPool: A highly parallel reinforcement learning environment execution engine. NeurIPS, 2022.
- [3] Stone Tao et al. ManiSkill3: GPU parallelized robotics simulation and rendering for generalizable embodied AI. RSS, 2025.
- [4] Yufei Jia et al. GS-Playground: A high-throughput photorealistic simulator for vision-informed robot learning. arXiv:2604.25459, 2026.
- [5] C. Daniel Freeman et al. Brax — A differentiable physics engine for large scale rigid body simulation. arXiv:2106.13281, 2021.
- [6] MuJoCo XLA Authors. MuJoCo XLA (MJX), 2024.
- [7] Viktor Makoviychuk et al. Isaac Gym: High performance GPU-based physics simulation for robot learning. arXiv:2108.10470, 2021.
- [8] Mayank Mittal et al. Isaac Lab: A GPU-accelerated simulation framework for multi-modal robot learning. arXiv:2511.04831, 2025.
- [9] Genesis Authors. Genesis: A generative and universal physics engine for robotics and beyond, 2024.
- [10] Google DeepMind and NVIDIA. MuJoCo Warp: GPU-optimized MuJoCo, 2025.
- [11] Rustam Eynaliyev and Houcen Liu. Combining GPU and CPU for accelerating evolutionary computing workloads. arXiv:2502.11129, 2025.
- [12] Kevin Zakka et al. MuJoCo Playground. arXiv:2502.08844, 2025.