Runtime Model¶
The detailed runtime contract is in ADR-0001 Runtime Model And Layer Boundaries and Developer 指南. This page keeps the English summary close to the code paths.
Two Runtime Shapes¶
Synchronous PPO Paths¶
scripts/train_rsl_rl.py and scripts/train_mlx_ppo.py compose Hydra config,
call registry bootstrap, construct the env through registry.make(...), and run
the learner in the same process. The RSL-RL path adapts NpEnv through
src/unilab/training/rsl_rl.py; the MLX path uses
src/unilab/algos/mlx/ppo/runner.py and src/unilab/algos/mlx/ppo/ppo.py.
Async APPO And Off-Policy Paths¶
APPO and off-policy runners use a CPU-sim-to-learner split:
CPU physics env loop -> shared IPC buffer -> learner
^ |
+------------- SharedWeightSync -------+
APPO uses
APPORunner,RolloutRingBuffer, andSharedWeightSync.SAC, TD3, and FlashSAC use off-policy runners with
ReplayBufferandSharedWeightSync.AsyncRunnerinsrc/unilab/ipc/async_runner.pyowns collector process startup, stop signaling, and shared-resource cleanup.
Boundary Rules¶
The env remains numpy/vectorized and returns
NpEnvState.GPU tensors and optimizer state belong to learner code, not env code.
Collector/learner protocols must reuse the existing IPC primitives instead of creating ad-hoc parallel protocols in scripts.
Evidence In Repo¶
PPO entrypoints:
scripts/train_rsl_rl.py,scripts/train_mlx_ppo.pyAPPO runner:
src/unilab/algos/torch/appo/runner.pyOff-policy runner:
src/unilab/algos/torch/offpolicy/runner.pyIPC primitives:
src/unilab/ipc/async_runner.py,src/unilab/ipc/rollout_ring_buffer.py,src/unilab/ipc/replay_buffer.py,src/unilab/ipc/weight_sync.py